Skip to content

Commit d5ffb1f

Browse files
jonmmeasejreback
authored andcommitted
Support merging DataFrames on a combo of columns and index levels (GH 14355) (pandas-dev#17484)
1 parent d74ac70 commit d5ffb1f

File tree

10 files changed

+1138
-38
lines changed

10 files changed

+1138
-38
lines changed

doc/source/merging.rst

+62-6
Original file line numberDiff line numberDiff line change
@@ -518,14 +518,16 @@ standard database join operations between DataFrame objects:
518518

519519
- ``left``: A DataFrame object
520520
- ``right``: Another DataFrame object
521-
- ``on``: Columns (names) to join on. Must be found in both the left and
522-
right DataFrame objects. If not passed and ``left_index`` and
521+
- ``on``: Column or index level names to join on. Must be found in both the left
522+
and right DataFrame objects. If not passed and ``left_index`` and
523523
``right_index`` are ``False``, the intersection of the columns in the
524524
DataFrames will be inferred to be the join keys
525-
- ``left_on``: Columns from the left DataFrame to use as keys. Can either be
526-
column names or arrays with length equal to the length of the DataFrame
527-
- ``right_on``: Columns from the right DataFrame to use as keys. Can either be
528-
column names or arrays with length equal to the length of the DataFrame
525+
- ``left_on``: Columns or index levels from the left DataFrame to use as
526+
keys. Can either be column names, index level names, or arrays with length
527+
equal to the length of the DataFrame
528+
- ``right_on``: Columns or index levels from the right DataFrame to use as
529+
keys. Can either be column names, index level names, or arrays with length
530+
equal to the length of the DataFrame
529531
- ``left_index``: If ``True``, use the index (row labels) from the left
530532
DataFrame as its join key(s). In the case of a DataFrame with a MultiIndex
531533
(hierarchical), the number of levels must match the number of join keys
@@ -563,6 +565,10 @@ standard database join operations between DataFrame objects:
563565

564566
.. versionadded:: 0.21.0
565567

568+
.. note::
569+
570+
Support for specifying index levels as the ``on``, ``left_on``, and
571+
``right_on`` parameters was added in version 0.22.0.
566572

567573
The return type will be the same as ``left``. If ``left`` is a ``DataFrame``
568574
and ``right`` is a subclass of DataFrame, the return type will still be
@@ -1121,6 +1127,56 @@ This is not Implemented via ``join`` at-the-moment, however it can be done using
11211127
labels=['left', 'right'], vertical=False);
11221128
plt.close('all');
11231129
1130+
.. _merging.merge_on_columns_and_levels:
1131+
1132+
Merging on a combination of columns and index levels
1133+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1134+
1135+
.. versionadded:: 0.22
1136+
1137+
Strings passed as the ``on``, ``left_on``, and ``right_on`` parameters
1138+
may refer to either column names or index level names. This enables merging
1139+
``DataFrame`` instances on a combination of index levels and columns without
1140+
resetting indexes.
1141+
1142+
.. ipython:: python
1143+
1144+
left_index = pd.Index(['K0', 'K0', 'K1', 'K2'], name='key1')
1145+
1146+
left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
1147+
'B': ['B0', 'B1', 'B2', 'B3'],
1148+
'key2': ['K0', 'K1', 'K0', 'K1']},
1149+
index=left_index)
1150+
1151+
right_index = pd.Index(['K0', 'K1', 'K2', 'K2'], name='key1')
1152+
1153+
right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
1154+
'D': ['D0', 'D1', 'D2', 'D3'],
1155+
'key2': ['K0', 'K0', 'K0', 'K1']},
1156+
index=right_index)
1157+
1158+
result = left.merge(right, on=['key1', 'key2'])
1159+
1160+
.. ipython:: python
1161+
:suppress:
1162+
1163+
@savefig merge_on_index_and_column.png
1164+
p.plot([left, right], result,
1165+
labels=['left', 'right'], vertical=False);
1166+
plt.close('all');
1167+
1168+
.. note::
1169+
1170+
When DataFrames are merged on a string that matches an index level in both
1171+
frames, the index level is preserved as an index level in the resulting
1172+
DataFrame.
1173+
1174+
.. note::
1175+
1176+
If a string matches both a column name and an index level name, then a
1177+
warning is issued and the column takes precedence. This will result in an
1178+
ambiguity error in a future version.
1179+
11241180
Overlapping value columns
11251181
~~~~~~~~~~~~~~~~~~~~~~~~~
11261182

doc/source/whatsnew/v0.22.0.txt

+31
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,37 @@ The :func:`get_dummies` now accepts a ``dtype`` argument, which specifies a dtyp
3232
pd.get_dummies(df, columns=['c'], dtype=bool).dtypes
3333

3434

35+
.. _whatsnew_0220.enhancements.merge_on_columns_and_levels:
36+
37+
Merging on a combination of columns and index levels
38+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
39+
40+
Strings passed to :meth:`DataFrame.merge` as the ``on``, ``left_on``, and ``right_on``
41+
parameters may now refer to either column names or index level names.
42+
This enables merging ``DataFrame`` instances on a combination of index levels
43+
and columns without resetting indexes. See the :ref:`Merge on columns and
44+
levels <merging.merge_on_columns_and_levels>` documentation section.
45+
(:issue:`14355`)
46+
47+
.. ipython:: python
48+
49+
left_index = pd.Index(['K0', 'K0', 'K1', 'K2'], name='key1')
50+
51+
left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
52+
'B': ['B0', 'B1', 'B2', 'B3'],
53+
'key2': ['K0', 'K1', 'K0', 'K1']},
54+
index=left_index)
55+
56+
right_index = pd.Index(['K0', 'K1', 'K2', 'K2'], name='key1')
57+
58+
right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
59+
'D': ['D0', 'D1', 'D2', 'D3'],
60+
'key2': ['K0', 'K0', 'K0', 'K1']},
61+
index=right_index)
62+
63+
left.merge(right, on=['key1', 'key2'])
64+
65+
3566
.. _whatsnew_0220.enhancements.other:
3667

3768
Other Enhancements

pandas/core/frame.py

+23-14
Original file line numberDiff line numberDiff line change
@@ -148,16 +148,17 @@
148148
* inner: use intersection of keys from both frames, similar to a SQL inner
149149
join; preserve the order of the left keys
150150
on : label or list
151-
Field names to join on. Must be found in both DataFrames. If on is
152-
None and not merging on indexes, then it merges on the intersection of
153-
the columns by default.
151+
Column or index level names to join on. These must be found in both
152+
DataFrames. If `on` is None and not merging on indexes then this defaults
153+
to the intersection of the columns in both DataFrames.
154154
left_on : label or list, or array-like
155-
Field names to join on in left DataFrame. Can be a vector or list of
156-
vectors of the length of the DataFrame to use a particular vector as
157-
the join key instead of columns
155+
Column or index level names to join on in the left DataFrame. Can also
156+
be an array or list of arrays of the length of the left DataFrame.
157+
These arrays are treated as if they are columns.
158158
right_on : label or list, or array-like
159-
Field names to join on in right DataFrame or vector/list of vectors per
160-
left_on docs
159+
Column or index level names to join on in the right DataFrame. Can also
160+
be an array or list of arrays of the length of the right DataFrame.
161+
These arrays are treated as if they are columns.
161162
left_index : boolean, default False
162163
Use the index from the left DataFrame as the join key(s). If it is a
163164
MultiIndex, the number of keys in the other DataFrame (either the index
@@ -196,6 +197,11 @@
196197
197198
.. versionadded:: 0.21.0
198199
200+
Notes
201+
-----
202+
Support for specifying index levels as the `on`, `left_on`, and
203+
`right_on` parameters was added in version 0.22.0
204+
199205
Examples
200206
--------
201207
@@ -5214,12 +5220,12 @@ def join(self, other, on=None, how='left', lsuffix='', rsuffix='',
52145220
Index should be similar to one of the columns in this one. If a
52155221
Series is passed, its name attribute must be set, and that will be
52165222
used as the column name in the resulting joined DataFrame
5217-
on : column name, tuple/list of column names, or array-like
5218-
Column(s) in the caller to join on the index in other,
5219-
otherwise joins index-on-index. If multiples
5220-
columns given, the passed DataFrame must have a MultiIndex. Can
5221-
pass an array as the join key if not already contained in the
5222-
calling DataFrame. Like an Excel VLOOKUP operation
5223+
on : name, tuple/list of names, or array-like
5224+
Column or index level name(s) in the caller to join on the index
5225+
in `other`, otherwise joins index-on-index. If multiple
5226+
values given, the `other` DataFrame must have a MultiIndex. Can
5227+
pass an array as the join key if it is not already contained in
5228+
the calling DataFrame. Like an Excel VLOOKUP operation
52235229
how : {'left', 'right', 'outer', 'inner'}, default: 'left'
52245230
How to handle the operation of the two objects.
52255231
@@ -5244,6 +5250,9 @@ def join(self, other, on=None, how='left', lsuffix='', rsuffix='',
52445250
on, lsuffix, and rsuffix options are not supported when passing a list
52455251
of DataFrame objects
52465252
5253+
Support for specifying index levels as the `on` parameter was added
5254+
in version 0.22.0
5255+
52475256
Examples
52485257
--------
52495258
>>> caller = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],

0 commit comments

Comments
 (0)