Skip to content

Commit 589d88d

Browse files
committed
ENH: concat and append now can handleunordered categories
1 parent 8023029 commit 589d88d

File tree

11 files changed

+653
-183
lines changed

11 files changed

+653
-183
lines changed

doc/source/categorical.rst

+50-5
Original file line numberDiff line numberDiff line change
@@ -675,12 +675,57 @@ be lexsorted, use ``sort_categories=True`` argument.
675675
676676
union_categoricals([a, b], sort_categories=True)
677677
678-
.. note::
678+
``union_categoricals`` also works with the "easy" case of combining two
679+
categoricals of the same categories and order information
680+
(e.g. what you could also ``append`` for).
681+
682+
.. ipython:: python
683+
684+
a = pd.Categorical(["a", "b"], ordered=True)
685+
b = pd.Categorical(["a", "b", "a"], ordered=True)
686+
union_categoricals([a, b])
687+
688+
The below raises ``TypeError`` because the categories are ordered and not identical.
689+
690+
.. code-block:: ipython
691+
692+
In [1]: a = pd.Categorical(["a", "b"], ordered=True)
693+
In [2]: b = pd.Categorical(["a", "b", "c"], ordered=True)
694+
In [3]: union_categoricals([a, b])
695+
Out[3]:
696+
TypeError: to union ordered Categoricals, all categories must be the same
697+
698+
.. _categorical.concat:
699+
700+
Concatenation
701+
~~~~~~~~~~~~~
702+
703+
This section describes concatenations specific to ``category`` dtype. See :ref:`Concatenating objects<merging.concat>` for general description.
704+
705+
By default, ``Series`` or ``DataFrame`` concatenation which contains different
706+
categories results in ``object`` dtype.
707+
708+
.. ipython:: python
709+
710+
s1 = pd.Series(['a', 'b'], dtype='category')
711+
s2 = pd.Series(['b', 'c'], dtype='category')
712+
pd.concat([s1, s2])
713+
714+
Specifying ``union_categoricals=True`` allows to concat categories following
715+
``union_categoricals`` rule.
716+
717+
.. ipython:: python
718+
719+
pd.concat([s1, s2], union_categoricals=True)
720+
721+
Following table summarizes the results of ``Categoricals`` related concatenations.
679722

680-
In addition to the "easy" case of combining two categoricals of the same
681-
categories and order information (e.g. what you could also ``append`` for),
682-
``union_categoricals`` only works with unordered categoricals and will
683-
raise if any are ordered.
723+
| arg1 | arg2 | default | ``union_categoricals=True`` |
724+
|---------|-------------------------------------------|---------|------------------------------|
725+
| category | category (identical categories) | category | category |
726+
| category | category (different categories, both not ordered) | object (dtype is inferred) | category |
727+
| category | category (different categories, either one is ordered) | object (dtype is inferred) | object (dtype is inferred) |
728+
| category | not category | object (dtype is inferred) | object (dtype is inferred)
684729
685730
Getting Data In/Out
686731
-------------------

doc/source/merging.rst

+19-13
Original file line numberDiff line numberDiff line change
@@ -78,34 +78,40 @@ some configurable handling of "what to do with the other axes":
7878
::
7979

8080
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
81-
keys=None, levels=None, names=None, verify_integrity=False)
81+
keys=None, levels=None, names=None, verify_integrity=False,
82+
union_categoricals=False, copy=True)
8283

83-
- ``objs``: a sequence or mapping of Series, DataFrame, or Panel objects. If a
84+
- ``objs`` : a sequence or mapping of Series, DataFrame, or Panel objects. If a
8485
dict is passed, the sorted keys will be used as the `keys` argument, unless
8586
it is passed, in which case the values will be selected (see below). Any None
8687
objects will be dropped silently unless they are all None in which case a
8788
ValueError will be raised.
88-
- ``axis``: {0, 1, ...}, default 0. The axis to concatenate along.
89-
- ``join``: {'inner', 'outer'}, default 'outer'. How to handle indexes on
89+
- ``axis`` : {0, 1, ...}, default 0. The axis to concatenate along.
90+
- ``join`` : {'inner', 'outer'}, default 'outer'. How to handle indexes on
9091
other axis(es). Outer for union and inner for intersection.
91-
- ``join_axes``: list of Index objects. Specific indexes to use for the other
92+
- ``ignore_index`` : boolean, default False. If True, do not use the index
93+
values on the concatenation axis. The resulting axis will be labeled 0, ...,
94+
n - 1. This is useful if you are concatenating objects where the
95+
concatenation axis does not have meaningful indexing information. Note
96+
the index values on the other axes are still respected in the join.
97+
- ``join_axes`` : list of Index objects. Specific indexes to use for the other
9298
n - 1 axes instead of performing inner/outer set logic.
93-
- ``keys``: sequence, default None. Construct hierarchical index using the
99+
- ``keys`` : sequence, default None. Construct hierarchical index using the
94100
passed keys as the outermost level. If multiple levels passed, should
95101
contain tuples.
96102
- ``levels`` : list of sequences, default None. Specific levels (unique values)
97103
to use for constructing a MultiIndex. Otherwise they will be inferred from the
98104
keys.
99-
- ``names``: list, default None. Names for the levels in the resulting
105+
- ``names`` : list, default None. Names for the levels in the resulting
100106
hierarchical index.
101-
- ``verify_integrity``: boolean, default False. Check whether the new
107+
- ``verify_integrity`` : boolean, default False. Check whether the new
102108
concatenated axis contains duplicates. This can be very expensive relative
103109
to the actual data concatenation.
104-
- ``ignore_index`` : boolean, default False. If True, do not use the index
105-
values on the concatenation axis. The resulting axis will be labeled 0, ...,
106-
n - 1. This is useful if you are concatenating objects where the
107-
concatenation axis does not have meaningful indexing information. Note
108-
the index values on the other axes are still respected in the join.
110+
- ``union_categoricals`` : boolean, default False.
111+
If True, use union_categoricals rule to concat category dtype.
112+
If False, category dtype is kept if both categories are identical,
113+
otherwise results in object dtype.
114+
See :ref:`Categoricals Concatenation<categorical.concat>` for detail.
109115
- ``copy`` : boolean, default True. If False, do not copy data unnecessarily.
110116

111117
Without a little bit of context and example many of these arguments don't make

doc/source/whatsnew/v0.19.0.txt

+33-1
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ Highlights include:
1515

1616
- :func:`merge_asof` for asof-style time-series joining, see :ref:`here <whatsnew_0190.enhancements.asof_merge>`
1717
- ``.rolling()`` are now time-series aware, see :ref:`here <whatsnew_0190.enhancements.rolling_ts>`
18+
- :func:`read_csv` now supports parsing ``Categorical`` data, see :ref:`here <whatsnew_0190.enhancements.read_csv_categorical>`
19+
- A function :func:`union_categorical` has been added for combining categoricals, see :ref:`here <whatsnew_0190.enhancements.union_categoricals>`
1820
- pandas development api, see :ref:`here <whatsnew_0190.dev_api>`
1921
- ``PeriodIndex`` now has its own ``period`` dtype, and changed to be more consistent with other ``Index`` classes. See :ref:`here <whatsnew_0190.api.period>`
2022
- Sparse data structures now gained enhanced support of ``int`` and ``bool`` dtypes, see :ref:`here <whatsnew_0190.sparse>`
@@ -277,6 +279,37 @@ Individual columns can be parsed as a ``Categorical`` using a dict specification
277279
df['col3'].cat.categories = pd.to_numeric(df['col3'].cat.categories)
278280
df['col3']
279281

282+
.. _whatsnew_0190.enhancements.union_categoricals:
283+
284+
Categorical Concatenation
285+
^^^^^^^^^^^^^^^^^^^^^^^^^
286+
287+
- A function :func:`union_categoricals` has been added for combining categoricals, see :ref:`Unioning Categoricals<categorical.union>` (:issue:`13361`, :issue:`:13763`, issue:`13846`)
288+
289+
.. ipython:: python
290+
291+
from pandas.types.concat import union_categoricals
292+
a = pd.Categorical(["b", "c"])
293+
b = pd.Categorical(["a", "b"])
294+
union_categoricals([a, b])
295+
296+
- ``concat`` and ``append`` now can concat unordered ``category`` dtypes using ``union_categorical`` internally. (:issue:`13524`)
297+
298+
By default, different categories results in ``object`` dtype.
299+
300+
.. ipython:: python
301+
302+
s1 = pd.Series(['a', 'b'], dtype='category')
303+
s2 = pd.Series(['b', 'c'], dtype='category')
304+
pd.concat([s1, s2])
305+
306+
Specifying ``union_categoricals=True`` allows to concat categories following
307+
``union_categoricals`` rule.
308+
309+
.. ipython:: python
310+
311+
pd.concat([s1, s2], union_categoricals=True)
312+
280313
.. _whatsnew_0190.enhancements.semi_month_offsets:
281314

282315
Semi-Month Offsets
@@ -448,7 +481,6 @@ Other enhancements
448481
- ``DataFrame`` has gained the ``.asof()`` method to return the last non-NaN values according to the selected subset (:issue:`13358`)
449482
- The ``DataFrame`` constructor will now respect key ordering if a list of ``OrderedDict`` objects are passed in (:issue:`13304`)
450483
- ``pd.read_html()`` has gained support for the ``decimal`` option (:issue:`12907`)
451-
- A function :func:`union_categorical` has been added for combining categoricals, see :ref:`Unioning Categoricals<categorical.union>` (:issue:`13361`, :issue:`:13763`, :issue:`13846`)
452484
- ``Series`` has gained the properties ``.is_monotonic``, ``.is_monotonic_increasing``, ``.is_monotonic_decreasing``, similar to ``Index`` (:issue:`13336`)
453485
- ``DataFrame.to_sql()`` now allows a single value as the SQL type for all columns (:issue:`11886`).
454486
- ``Series.append`` now supports the ``ignore_index`` option (:issue:`13677`)

pandas/core/frame.py

+8-2
Original file line numberDiff line numberDiff line change
@@ -4322,7 +4322,8 @@ def infer(x):
43224322
# ----------------------------------------------------------------------
43234323
# Merging / joining methods
43244324

4325-
def append(self, other, ignore_index=False, verify_integrity=False):
4325+
def append(self, other, ignore_index=False, verify_integrity=False,
4326+
union_categoricals=False):
43264327
"""
43274328
Append rows of `other` to the end of this frame, returning a new
43284329
object. Columns not in this frame are added as new columns.
@@ -4335,6 +4336,10 @@ def append(self, other, ignore_index=False, verify_integrity=False):
43354336
If True, do not use the index labels.
43364337
verify_integrity : boolean, default False
43374338
If True, raise ValueError on creating index with duplicates.
4339+
union_categoricals : bool, default False
4340+
If True, use union_categoricals rule to concat category dtype.
4341+
If False, category dtype is kept if both categories are identical,
4342+
otherwise results in object dtype.
43384343
43394344
Returns
43404345
-------
@@ -4411,7 +4416,8 @@ def append(self, other, ignore_index=False, verify_integrity=False):
44114416
else:
44124417
to_concat = [self, other]
44134418
return concat(to_concat, ignore_index=ignore_index,
4414-
verify_integrity=verify_integrity)
4419+
verify_integrity=verify_integrity,
4420+
union_categoricals=union_categoricals)
44154421

44164422
def join(self, other, on=None, how='left', lsuffix='', rsuffix='',
44174423
sort=False):

pandas/core/internals.py

+16-9
Original file line numberDiff line numberDiff line change
@@ -1144,7 +1144,7 @@ def get_result(other):
11441144
return self._try_coerce_result(result)
11451145

11461146
# error handler if we have an issue operating with the function
1147-
def handle_error():
1147+
def handle_error(detail):
11481148

11491149
if raise_on_error:
11501150
raise TypeError('Could not operate %s with block values %s' %
@@ -1165,7 +1165,7 @@ def handle_error():
11651165
except ValueError as detail:
11661166
raise
11671167
except Exception as detail:
1168-
result = handle_error()
1168+
result = handle_error(detail)
11691169

11701170
# technically a broadcast error in numpy can 'work' by returning a
11711171
# boolean False
@@ -4771,7 +4771,8 @@ def _putmask_smart(v, m, n):
47714771
return nv
47724772

47734773

4774-
def concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy):
4774+
def concatenate_block_managers(mgrs_indexers, axes, concat_axis,
4775+
copy, union_categoricals=False):
47754776
"""
47764777
Concatenate block managers into one.
47774778
@@ -4781,16 +4782,20 @@ def concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy):
47814782
axes : list of Index
47824783
concat_axis : int
47834784
copy : bool
4785+
union_categoricals : bool, default False
4786+
If True, use union_categoricals rule to concat CategoricalBlock.
4787+
If False, CategoricalBlock is kept if both categories are
4788+
identical, otherwise results in ObjectBlock.
47844789
47854790
"""
47864791
concat_plan = combine_concat_plans(
47874792
[get_mgr_concatenation_plan(mgr, indexers)
47884793
for mgr, indexers in mgrs_indexers], concat_axis)
47894794

4790-
blocks = [make_block(concatenate_join_units(join_units, concat_axis,
4791-
copy=copy),
4792-
placement=placement)
4793-
for placement, join_units in concat_plan]
4795+
blocks = [make_block(
4796+
concatenate_join_units(join_units, concat_axis, copy=copy,
4797+
union_categoricals=union_categoricals),
4798+
placement=placement) for placement, join_units in concat_plan]
47944799

47954800
return BlockManager(blocks, axes)
47964801

@@ -4875,7 +4880,8 @@ def get_empty_dtype_and_na(join_units):
48754880
raise AssertionError("invalid dtype determination in get_concat_dtype")
48764881

48774882

4878-
def concatenate_join_units(join_units, concat_axis, copy):
4883+
def concatenate_join_units(join_units, concat_axis, copy,
4884+
union_categoricals=False):
48794885
"""
48804886
Concatenate values from several join units along selected axis.
48814887
"""
@@ -4895,7 +4901,8 @@ def concatenate_join_units(join_units, concat_axis, copy):
48954901
if copy and concat_values.base is not None:
48964902
concat_values = concat_values.copy()
48974903
else:
4898-
concat_values = _concat._concat_compat(to_concat, axis=concat_axis)
4904+
concat_values = _concat._concat_compat(
4905+
to_concat, axis=concat_axis, union_categoricals=union_categoricals)
48994906

49004907
return concat_values
49014908

pandas/core/series.py

+8-2
Original file line numberDiff line numberDiff line change
@@ -1525,7 +1525,8 @@ def searchsorted(self, v, side='left', sorter=None):
15251525
# -------------------------------------------------------------------
15261526
# Combination
15271527

1528-
def append(self, to_append, ignore_index=False, verify_integrity=False):
1528+
def append(self, to_append, ignore_index=False, verify_integrity=False,
1529+
union_categoricals=False):
15291530
"""
15301531
Concatenate two or more Series.
15311532
@@ -1539,6 +1540,10 @@ def append(self, to_append, ignore_index=False, verify_integrity=False):
15391540
15401541
verify_integrity : boolean, default False
15411542
If True, raise Exception on creating index with duplicates
1543+
union_categoricals : bool, default False
1544+
If True, use union_categoricals rule to concat category dtype.
1545+
If False, category dtype is kept if both categories are identical,
1546+
otherwise results in object dtype.
15421547
15431548
Returns
15441549
-------
@@ -1592,7 +1597,8 @@ def append(self, to_append, ignore_index=False, verify_integrity=False):
15921597
else:
15931598
to_concat = [self, to_append]
15941599
return concat(to_concat, ignore_index=ignore_index,
1595-
verify_integrity=verify_integrity)
1600+
verify_integrity=verify_integrity,
1601+
union_categoricals=union_categoricals)
15961602

15971603
def _binop(self, other, func, level=None, fill_value=None):
15981604
"""

pandas/tests/series/test_combine_concat.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -185,9 +185,9 @@ def test_concat_empty_series_dtypes(self):
185185
'category')
186186
self.assertEqual(pd.concat([Series(dtype='category'),
187187
Series(dtype='float64')]).dtype,
188-
np.object_)
188+
'float64')
189189
self.assertEqual(pd.concat([Series(dtype='category'),
190-
Series(dtype='object')]).dtype, 'category')
190+
Series(dtype='object')]).dtype, 'object')
191191

192192
# sparse
193193
result = pd.concat([Series(dtype='float64').to_sparse(), Series(

0 commit comments

Comments
 (0)