Skip to content

Commit ba5483c

Browse files
Dr-IrvPingviinituutti
authored andcommitted
Additional DOC and BUG fix related to merging with mix of columns and… (pandas-dev#20475)
1 parent d1b9134 commit ba5483c

File tree

5 files changed

+125
-67
lines changed

5 files changed

+125
-67
lines changed

doc/source/merging.rst

+87-56
Original file line numberDiff line numberDiff line change
@@ -31,10 +31,10 @@ operations.
3131
Concatenating objects
3232
---------------------
3333

34-
The :func:`~pandas.concat` function (in the main pandas namespace) does all of
35-
the heavy lifting of performing concatenation operations along an axis while
36-
performing optional set logic (union or intersection) of the indexes (if any) on
37-
the other axes. Note that I say "if any" because there is only a single possible
34+
The :func:`~pandas.concat` function (in the main pandas namespace) does all of
35+
the heavy lifting of performing concatenation operations along an axis while
36+
performing optional set logic (union or intersection) of the indexes (if any) on
37+
the other axes. Note that I say "if any" because there is only a single possible
3838
axis of concatenation for Series.
3939

4040
Before diving into all of the details of ``concat`` and what it can do, here is
@@ -109,9 +109,9 @@ some configurable handling of "what to do with the other axes":
109109
to the actual data concatenation.
110110
* ``copy`` : boolean, default True. If False, do not copy data unnecessarily.
111111

112-
Without a little bit of context many of these arguments don't make much sense.
113-
Let's revisit the above example. Suppose we wanted to associate specific keys
114-
with each of the pieces of the chopped up DataFrame. We can do this using the
112+
Without a little bit of context many of these arguments don't make much sense.
113+
Let's revisit the above example. Suppose we wanted to associate specific keys
114+
with each of the pieces of the chopped up DataFrame. We can do this using the
115115
``keys`` argument:
116116

117117
.. ipython:: python
@@ -138,9 +138,9 @@ It's not a stretch to see how this can be very useful. More detail on this
138138
functionality below.
139139

140140
.. note::
141-
It is worth noting that :func:`~pandas.concat` (and therefore
142-
:func:`~pandas.append`) makes a full copy of the data, and that constantly
143-
reusing this function can create a significant performance hit. If you need
141+
It is worth noting that :func:`~pandas.concat` (and therefore
142+
:func:`~pandas.append`) makes a full copy of the data, and that constantly
143+
reusing this function can create a significant performance hit. If you need
144144
to use the operation over several datasets, use a list comprehension.
145145

146146
::
@@ -224,8 +224,8 @@ DataFrame:
224224
Concatenating using ``append``
225225
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
226226

227-
A useful shortcut to :func:`~pandas.concat` are the :meth:`~DataFrame.append`
228-
instance methods on ``Series`` and ``DataFrame``. These methods actually predated
227+
A useful shortcut to :func:`~pandas.concat` are the :meth:`~DataFrame.append`
228+
instance methods on ``Series`` and ``DataFrame``. These methods actually predated
229229
``concat``. They concatenate along ``axis=0``, namely the index:
230230

231231
.. ipython:: python
@@ -271,8 +271,8 @@ need to be:
271271
272272
.. note::
273273

274-
Unlike the :py:meth:`~list.append` method, which appends to the original list
275-
and returns ``None``, :meth:`~DataFrame.append` here **does not** modify
274+
Unlike the :py:meth:`~list.append` method, which appends to the original list
275+
and returns ``None``, :meth:`~DataFrame.append` here **does not** modify
276276
``df1`` and returns its copy with ``df2`` appended.
277277

278278
.. _merging.ignore_index:
@@ -370,9 +370,9 @@ Passing ``ignore_index=True`` will drop all name references.
370370
More concatenating with group keys
371371
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
372372

373-
A fairly common use of the ``keys`` argument is to override the column names
373+
A fairly common use of the ``keys`` argument is to override the column names
374374
when creating a new ``DataFrame`` based on existing ``Series``.
375-
Notice how the default behaviour consists on letting the resulting ``DataFrame``
375+
Notice how the default behaviour consists on letting the resulting ``DataFrame``
376376
inherit the parent ``Series``' name, when these existed.
377377

378378
.. ipython:: python
@@ -468,7 +468,7 @@ Appending rows to a DataFrame
468468
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
469469

470470
While not especially efficient (since a new object must be created), you can
471-
append a single row to a ``DataFrame`` by passing a ``Series`` or dict to
471+
append a single row to a ``DataFrame`` by passing a ``Series`` or dict to
472472
``append``, which returns a new ``DataFrame`` as above.
473473

474474
.. ipython:: python
@@ -513,15 +513,15 @@ pandas has full-featured, **high performance** in-memory join operations
513513
idiomatically very similar to relational databases like SQL. These methods
514514
perform significantly better (in some cases well over an order of magnitude
515515
better) than other open source implementations (like ``base::merge.data.frame``
516-
in R). The reason for this is careful algorithmic design and the internal layout
516+
in R). The reason for this is careful algorithmic design and the internal layout
517517
of the data in ``DataFrame``.
518518

519519
See the :ref:`cookbook<cookbook.merge>` for some advanced strategies.
520520

521521
Users who are familiar with SQL but new to pandas might be interested in a
522522
:ref:`comparison with SQL<compare_with_sql.join>`.
523523

524-
pandas provides a single function, :func:`~pandas.merge`, as the entry point for
524+
pandas provides a single function, :func:`~pandas.merge`, as the entry point for
525525
all standard database join operations between ``DataFrame`` or named ``Series`` objects:
526526

527527
::
@@ -590,7 +590,7 @@ The return type will be the same as ``left``. If ``left`` is a ``DataFrame`` or
590590
and ``right`` is a subclass of ``DataFrame``, the return type will still be ``DataFrame``.
591591

592592
``merge`` is a function in the pandas namespace, and it is also available as a
593-
``DataFrame`` instance method :meth:`~DataFrame.merge`, with the calling
593+
``DataFrame`` instance method :meth:`~DataFrame.merge`, with the calling
594594
``DataFrame`` being implicitly considered the left object in the join.
595595

596596
The related :meth:`~DataFrame.join` method, uses ``merge`` internally for the
@@ -602,7 +602,7 @@ Brief primer on merge methods (relational algebra)
602602

603603
Experienced users of relational databases like SQL will be familiar with the
604604
terminology used to describe join operations between two SQL-table like
605-
structures (``DataFrame`` objects). There are several cases to consider which
605+
structures (``DataFrame`` objects). There are several cases to consider which
606606
are very important to understand:
607607

608608
* **one-to-one** joins: for example when joining two ``DataFrame`` objects on
@@ -642,8 +642,8 @@ key combination:
642642
labels=['left', 'right'], vertical=False);
643643
plt.close('all');
644644
645-
Here is a more complicated example with multiple join keys. Only the keys
646-
appearing in ``left`` and ``right`` are present (the intersection), since
645+
Here is a more complicated example with multiple join keys. Only the keys
646+
appearing in ``left`` and ``right`` are present (the intersection), since
647647
``how='inner'`` by default.
648648

649649
.. ipython:: python
@@ -759,13 +759,13 @@ Checking for duplicate keys
759759

760760
.. versionadded:: 0.21.0
761761

762-
Users can use the ``validate`` argument to automatically check whether there
763-
are unexpected duplicates in their merge keys. Key uniqueness is checked before
764-
merge operations and so should protect against memory overflows. Checking key
765-
uniqueness is also a good way to ensure user data structures are as expected.
762+
Users can use the ``validate`` argument to automatically check whether there
763+
are unexpected duplicates in their merge keys. Key uniqueness is checked before
764+
merge operations and so should protect against memory overflows. Checking key
765+
uniqueness is also a good way to ensure user data structures are as expected.
766766

767-
In the following example, there are duplicate values of ``B`` in the right
768-
``DataFrame``. As this is not a one-to-one merge -- as specified in the
767+
In the following example, there are duplicate values of ``B`` in the right
768+
``DataFrame``. As this is not a one-to-one merge -- as specified in the
769769
``validate`` argument -- an exception will be raised.
770770

771771

@@ -778,11 +778,11 @@ In the following example, there are duplicate values of ``B`` in the right
778778
779779
In [53]: result = pd.merge(left, right, on='B', how='outer', validate="one_to_one")
780780
...
781-
MergeError: Merge keys are not unique in right dataset; not a one-to-one merge
781+
MergeError: Merge keys are not unique in right dataset; not a one-to-one merge
782782
783-
If the user is aware of the duplicates in the right ``DataFrame`` but wants to
784-
ensure there are no duplicates in the left DataFrame, one can use the
785-
``validate='one_to_many'`` argument instead, which will not raise an exception.
783+
If the user is aware of the duplicates in the right ``DataFrame`` but wants to
784+
ensure there are no duplicates in the left DataFrame, one can use the
785+
``validate='one_to_many'`` argument instead, which will not raise an exception.
786786

787787
.. ipython:: python
788788
@@ -794,8 +794,8 @@ ensure there are no duplicates in the left DataFrame, one can use the
794794
The merge indicator
795795
~~~~~~~~~~~~~~~~~~~
796796

797-
:func:`~pandas.merge` accepts the argument ``indicator``. If ``True``, a
798-
Categorical-type column called ``_merge`` will be added to the output object
797+
:func:`~pandas.merge` accepts the argument ``indicator``. If ``True``, a
798+
Categorical-type column called ``_merge`` will be added to the output object
799799
that takes on values:
800800

801801
=================================== ================
@@ -903,7 +903,7 @@ Joining on index
903903
~~~~~~~~~~~~~~~~
904904

905905
:meth:`DataFrame.join` is a convenient method for combining the columns of two
906-
potentially differently-indexed ``DataFrames`` into a single result
906+
potentially differently-indexed ``DataFrames`` into a single result
907907
``DataFrame``. Here is a very basic example:
908908

909909
.. ipython:: python
@@ -983,9 +983,9 @@ indexes:
983983
Joining key columns on an index
984984
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
985985

986-
:meth:`~DataFrame.join` takes an optional ``on`` argument which may be a column
986+
:meth:`~DataFrame.join` takes an optional ``on`` argument which may be a column
987987
or multiple column names, which specifies that the passed ``DataFrame`` is to be
988-
aligned on that column in the ``DataFrame``. These two function calls are
988+
aligned on that column in the ``DataFrame``. These two function calls are
989989
completely equivalent:
990990

991991
::
@@ -995,7 +995,7 @@ completely equivalent:
995995
how='left', sort=False)
996996

997997
Obviously you can choose whichever form you find more convenient. For
998-
many-to-one joins (where one of the ``DataFrame``'s is already indexed by the
998+
many-to-one joins (where one of the ``DataFrame``'s is already indexed by the
999999
join key), using ``join`` may be more convenient. Here is a simple example:
10001000

10011001
.. ipython:: python
@@ -1133,17 +1133,42 @@ This is equivalent but less verbose and more memory efficient / faster than this
11331133
Joining with two MultiIndexes
11341134
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11351135

1136-
This is not implemented via ``join`` at-the-moment, however it can be done using
1137-
the following code.
1136+
This is supported in a limited way, provided that the index for the right
1137+
argument is completely used in the join, and is a subset of the indices in
1138+
the left argument, as in this example:
11381139

11391140
.. ipython:: python
11401141
1141-
index = pd.MultiIndex.from_tuples([('K0', 'X0'), ('K0', 'X1'),
1142-
('K1', 'X2')],
1143-
names=['key', 'X'])
1142+
leftindex = pd.MultiIndex.from_product([list('abc'), list('xy'), [1, 2]],
1143+
names=['abc', 'xy', 'num'])
1144+
left = pd.DataFrame({'v1' : range(12)}, index=leftindex)
1145+
left
1146+
1147+
rightindex = pd.MultiIndex.from_product([list('abc'), list('xy')],
1148+
names=['abc', 'xy'])
1149+
right = pd.DataFrame({'v2': [100*i for i in range(1, 7)]}, index=rightindex)
1150+
right
1151+
1152+
left.join(right, on=['abc', 'xy'], how='inner')
1153+
1154+
If that condition is not satisfied, a join with two multi-indexes can be
1155+
done using the following code.
1156+
1157+
.. ipython:: python
1158+
1159+
leftindex = pd.MultiIndex.from_tuples([('K0', 'X0'), ('K0', 'X1'),
1160+
('K1', 'X2')],
1161+
names=['key', 'X'])
11441162
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
11451163
'B': ['B0', 'B1', 'B2']},
1146-
index=index)
1164+
index=leftindex)
1165+
1166+
rightindex = pd.MultiIndex.from_tuples([('K0', 'Y0'), ('K1', 'Y1'),
1167+
('K2', 'Y2'), ('K2', 'Y3')],
1168+
names=['key', 'Y'])
1169+
right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
1170+
'D': ['D0', 'D1', 'D2', 'D3']},
1171+
index=rightindex)
11471172
11481173
result = pd.merge(left.reset_index(), right.reset_index(),
11491174
on=['key'], how='inner').set_index(['key','X','Y'])
@@ -1161,7 +1186,7 @@ the following code.
11611186
Merging on a combination of columns and index levels
11621187
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11631188

1164-
.. versionadded:: 0.22
1189+
.. versionadded:: 0.23
11651190

11661191
Strings passed as the ``on``, ``left_on``, and ``right_on`` parameters
11671192
may refer to either column names or index level names. This enables merging
@@ -1200,6 +1225,12 @@ resetting indexes.
12001225
frames, the index level is preserved as an index level in the resulting
12011226
DataFrame.
12021227

1228+
.. note::
1229+
When DataFrames are merged using only some of the levels of a `MultiIndex`,
1230+
the extra levels will be dropped from the resulting merge. In order to
1231+
preserve those levels, use ``reset_index`` on those level names to move
1232+
those levels to columns prior to doing the merge.
1233+
12031234
.. note::
12041235

12051236
If a string matches both a column name and an index level name, then a
@@ -1262,7 +1293,7 @@ similarly.
12621293
Joining multiple DataFrame or Panel objects
12631294
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
12641295

1265-
A list or tuple of ``DataFrames`` can also be passed to :meth:`~DataFrame.join`
1296+
A list or tuple of ``DataFrames`` can also be passed to :meth:`~DataFrame.join`
12661297
to join them together on their indexes.
12671298

12681299
.. ipython:: python
@@ -1284,7 +1315,7 @@ Merging together values within Series or DataFrame columns
12841315
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
12851316

12861317
Another fairly common situation is to have two like-indexed (or similarly
1287-
indexed) ``Series`` or ``DataFrame`` objects and wanting to "patch" values in
1318+
indexed) ``Series`` or ``DataFrame`` objects and wanting to "patch" values in
12881319
one object from values for matching indices in the other. Here is an example:
12891320

12901321
.. ipython:: python
@@ -1309,7 +1340,7 @@ For this, use the :meth:`~DataFrame.combine_first` method:
13091340
plt.close('all');
13101341
13111342
Note that this method only takes values from the right ``DataFrame`` if they are
1312-
missing in the left ``DataFrame``. A related method, :meth:`~DataFrame.update`,
1343+
missing in the left ``DataFrame``. A related method, :meth:`~DataFrame.update`,
13131344
alters non-NA values in place:
13141345

13151346
.. ipython:: python
@@ -1361,15 +1392,15 @@ Merging AsOf
13611392

13621393
.. versionadded:: 0.19.0
13631394

1364-
A :func:`merge_asof` is similar to an ordered left-join except that we match on
1365-
nearest key rather than equal keys. For each row in the ``left`` ``DataFrame``,
1366-
we select the last row in the ``right`` ``DataFrame`` whose ``on`` key is less
1395+
A :func:`merge_asof` is similar to an ordered left-join except that we match on
1396+
nearest key rather than equal keys. For each row in the ``left`` ``DataFrame``,
1397+
we select the last row in the ``right`` ``DataFrame`` whose ``on`` key is less
13671398
than the left's key. Both DataFrames must be sorted by the key.
13681399

1369-
Optionally an asof merge can perform a group-wise merge. This matches the
1400+
Optionally an asof merge can perform a group-wise merge. This matches the
13701401
``by`` key equally, in addition to the nearest match on the ``on`` key.
13711402

1372-
For example; we might have ``trades`` and ``quotes`` and we want to ``asof``
1403+
For example; we might have ``trades`` and ``quotes`` and we want to ``asof``
13731404
merge them.
13741405

13751406
.. ipython:: python
@@ -1428,8 +1459,8 @@ We only asof within ``2ms`` between the quote time and the trade time.
14281459
by='ticker',
14291460
tolerance=pd.Timedelta('2ms'))
14301461
1431-
We only asof within ``10ms`` between the quote time and the trade time and we
1432-
exclude exact matches on time. Note that though we exclude the exact matches
1462+
We only asof within ``10ms`` between the quote time and the trade time and we
1463+
exclude exact matches on time. Note that though we exclude the exact matches
14331464
(of the quotes), prior quotes **do** propagate to that point in time.
14341465

14351466
.. ipython:: python

doc/source/whatsnew/v0.24.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -1545,6 +1545,7 @@ Reshaping
15451545
- Bug in :meth:`DataFrame.append` with a :class:`Series` with a dateutil timezone would raise a ``TypeError`` (:issue:`23682`)
15461546
- Bug in ``Series`` construction when passing no data and ``dtype=str`` (:issue:`22477`)
15471547
- Bug in :func:`cut` with ``bins`` as an overlapping ``IntervalIndex`` where multiple bins were returned per item instead of raising a ``ValueError`` (:issue:`23980`)
1548+
- Bug in :meth:`DataFrame.join` when joining on partial MultiIndex would drop names (:issue:`20452`).
15481549

15491550
.. _whatsnew_0240.bug_fixes.sparse:
15501551

pandas/core/reshape/merge.py

+1
Original file line numberDiff line numberDiff line change
@@ -715,6 +715,7 @@ def _maybe_add_join_keys(self, result, left_indexer, right_indexer):
715715
result[name] = key_col
716716
elif result._is_level_reference(name):
717717
if isinstance(result.index, MultiIndex):
718+
key_col.name = name
718719
idx_list = [result.index.get_level_values(level_name)
719720
if level_name != name else key_col
720721
for level_name in result.index.names]

pandas/tests/reshape/merge/test_join.py

+25
Original file line numberDiff line numberDiff line change
@@ -730,6 +730,31 @@ def test_panel_join_many(self):
730730
pytest.raises(ValueError, panels[0].join, panels[1:],
731731
how='right')
732732

733+
def test_join_multi_to_multi(self, join_type):
734+
# GH 20475
735+
leftindex = MultiIndex.from_product([list('abc'), list('xy'), [1, 2]],
736+
names=['abc', 'xy', 'num'])
737+
left = DataFrame({'v1': range(12)}, index=leftindex)
738+
739+
rightindex = MultiIndex.from_product([list('abc'), list('xy')],
740+
names=['abc', 'xy'])
741+
right = DataFrame({'v2': [100 * i for i in range(1, 7)]},
742+
index=rightindex)
743+
744+
result = left.join(right, on=['abc', 'xy'], how=join_type)
745+
expected = (left.reset_index()
746+
.merge(right.reset_index(),
747+
on=['abc', 'xy'], how=join_type)
748+
.set_index(['abc', 'xy', 'num'])
749+
)
750+
assert_frame_equal(expected, result)
751+
752+
with pytest.raises(ValueError):
753+
left.join(right, on='xy', how=join_type)
754+
755+
with pytest.raises(ValueError):
756+
right.join(left, on=['abc', 'xy'], how=join_type)
757+
733758

734759
def _check_join(left, right, result, join_col, how='left',
735760
lsuffix='_x', rsuffix='_y'):

0 commit comments

Comments
 (0)