Skip to content

Commit fa4dbac

Browse files
harisbalharisbal
authored and
harisbal
committed
Allow for join between to multi-index dataframe instances
1 parent c857b4f commit fa4dbac

File tree

5 files changed

+357
-91
lines changed

5 files changed

+357
-91
lines changed

doc/source/merging.rst

+81-48
Original file line numberDiff line numberDiff line change
@@ -31,10 +31,17 @@ operations.
3131
Concatenating objects
3232
---------------------
3333

34+
<<<<<<< HEAD
35+
The :func:`~pandas.concat` function (in the main pandas namespace) does all of
36+
the heavy lifting of performing concatenation operations along an axis while
37+
performing optional set logic (union or intersection) of the indexes (if any) on
38+
the other axes. Note that I say "if any" because there is only a single possible
39+
=======
3440
The :func:`~pandas.concat` function (in the main pandas namespace) does all of
3541
the heavy lifting of performing concatenation operations along an axis while
3642
performing optional set logic (union or intersection) of the indexes (if any) on
3743
the other axes. Note that I say "if any" because there is only a single possible
44+
>>>>>>> remotes/upstream/master
3845
axis of concatenation for Series.
3946

4047
Before diving into all of the details of ``concat`` and what it can do, here is
@@ -109,9 +116,9 @@ some configurable handling of "what to do with the other axes":
109116
to the actual data concatenation.
110117
- ``copy`` : boolean, default True. If False, do not copy data unnecessarily.
111118

112-
Without a little bit of context many of these arguments don't make much sense.
113-
Let's revisit the above example. Suppose we wanted to associate specific keys
114-
with each of the pieces of the chopped up DataFrame. We can do this using the
119+
Without a little bit of context many of these arguments don't make much sense.
120+
Let's revisit the above example. Suppose we wanted to associate specific keys
121+
with each of the pieces of the chopped up DataFrame. We can do this using the
115122
``keys`` argument:
116123

117124
.. ipython:: python
@@ -138,9 +145,9 @@ It's not a stretch to see how this can be very useful. More detail on this
138145
functionality below.
139146

140147
.. note::
141-
It is worth noting that :func:`~pandas.concat` (and therefore
142-
:func:`~pandas.append`) makes a full copy of the data, and that constantly
143-
reusing this function can create a significant performance hit. If you need
148+
It is worth noting that :func:`~pandas.concat` (and therefore
149+
:func:`~pandas.append`) makes a full copy of the data, and that constantly
150+
reusing this function can create a significant performance hit. If you need
144151
to use the operation over several datasets, use a list comprehension.
145152

146153
::
@@ -153,7 +160,7 @@ Set logic on the other axes
153160
~~~~~~~~~~~~~~~~~~~~~~~~~~~
154161

155162
When gluing together multiple DataFrames, you have a choice of how to handle
156-
the other axes (other than the one being concatenated). This can be done in
163+
the other axes (other than the one being concatenated). This can be done in
157164
the following three ways:
158165

159166
- Take the (sorted) union of them all, ``join='outer'``. This is the default
@@ -216,8 +223,8 @@ DataFrame:
216223
Concatenating using ``append``
217224
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
218225

219-
A useful shortcut to :func:`~pandas.concat` are the :meth:`~DataFrame.append`
220-
instance methods on ``Series`` and ``DataFrame``. These methods actually predated
226+
A useful shortcut to :func:`~pandas.concat` are the :meth:`~DataFrame.append`
227+
instance methods on ``Series`` and ``DataFrame``. These methods actually predated
221228
``concat``. They concatenate along ``axis=0``, namely the index:
222229

223230
.. ipython:: python
@@ -263,8 +270,8 @@ need to be:
263270
264271
.. note::
265272

266-
Unlike the :py:meth:`~list.append` method, which appends to the original list
267-
and returns ``None``, :meth:`~DataFrame.append` here **does not** modify
273+
Unlike the :py:meth:`~list.append` method, which appends to the original list
274+
and returns ``None``, :meth:`~DataFrame.append` here **does not** modify
268275
``df1`` and returns its copy with ``df2`` appended.
269276

270277
.. _merging.ignore_index:
@@ -362,9 +369,9 @@ Passing ``ignore_index=True`` will drop all name references.
362369
More concatenating with group keys
363370
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
364371

365-
A fairly common use of the ``keys`` argument is to override the column names
372+
A fairly common use of the ``keys`` argument is to override the column names
366373
when creating a new ``DataFrame`` based on existing ``Series``.
367-
Notice how the default behaviour consists on letting the resulting ``DataFrame``
374+
Notice how the default behaviour consists on letting the resulting ``DataFrame``
368375
inherit the parent ``Series``' name, when these existed.
369376

370377
.. ipython:: python
@@ -460,7 +467,7 @@ Appending rows to a DataFrame
460467
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
461468

462469
While not especially efficient (since a new object must be created), you can
463-
append a single row to a ``DataFrame`` by passing a ``Series`` or dict to
470+
append a single row to a ``DataFrame`` by passing a ``Series`` or dict to
464471
``append``, which returns a new ``DataFrame`` as above.
465472

466473
.. ipython:: python
@@ -505,15 +512,15 @@ pandas has full-featured, **high performance** in-memory join operations
505512
idiomatically very similar to relational databases like SQL. These methods
506513
perform significantly better (in some cases well over an order of magnitude
507514
better) than other open source implementations (like ``base::merge.data.frame``
508-
in R). The reason for this is careful algorithmic design and the internal layout
515+
in R). The reason for this is careful algorithmic design and the internal layout
509516
of the data in ``DataFrame``.
510517

511518
See the :ref:`cookbook<cookbook.merge>` for some advanced strategies.
512519

513520
Users who are familiar with SQL but new to pandas might be interested in a
514521
:ref:`comparison with SQL<compare_with_sql.join>`.
515522

516-
pandas provides a single function, :func:`~pandas.merge`, as the entry point for
523+
pandas provides a single function, :func:`~pandas.merge`, as the entry point for
517524
all standard database join operations between ``DataFrame`` objects:
518525

519526
::
@@ -582,7 +589,11 @@ and ``right`` is a subclass of DataFrame, the return type will still be
582589
``DataFrame``.
583590

584591
``merge`` is a function in the pandas namespace, and it is also available as a
592+
<<<<<<< HEAD
593+
``DataFrame`` instance method :meth:`~DataFrame.merge`, with the calling
594+
=======
585595
``DataFrame`` instance method :meth:`~DataFrame.merge`, with the calling
596+
>>>>>>> remotes/upstream/master
586597
``DataFrame `` being implicitly considered the left object in the join.
587598
588599
The related :meth:`~DataFrame.join` method, uses ``merge`` internally for the
@@ -594,7 +605,7 @@ Brief primer on merge methods (relational algebra)
594605

595606
Experienced users of relational databases like SQL will be familiar with the
596607
terminology used to describe join operations between two SQL-table like
597-
structures (``DataFrame`` objects). There are several cases to consider which
608+
structures (``DataFrame`` objects). There are several cases to consider which
598609
are very important to understand:
599610

600611
- **one-to-one** joins: for example when joining two ``DataFrame`` objects on
@@ -634,8 +645,8 @@ key combination:
634645
labels=['left', 'right'], vertical=False);
635646
plt.close('all');
636647
637-
Here is a more complicated example with multiple join keys. Only the keys
638-
appearing in ``left`` and ``right`` are present (the intersection), since
648+
Here is a more complicated example with multiple join keys. Only the keys
649+
appearing in ``left`` and ``right`` are present (the intersection), since
639650
``how='inner'`` by default.
640651

641652
.. ipython:: python
@@ -751,13 +762,13 @@ Checking for duplicate keys
751762

752763
.. versionadded:: 0.21.0
753764

754-
Users can use the ``validate`` argument to automatically check whether there
755-
are unexpected duplicates in their merge keys. Key uniqueness is checked before
756-
merge operations and so should protect against memory overflows. Checking key
757-
uniqueness is also a good way to ensure user data structures are as expected.
765+
Users can use the ``validate`` argument to automatically check whether there
766+
are unexpected duplicates in their merge keys. Key uniqueness is checked before
767+
merge operations and so should protect against memory overflows. Checking key
768+
uniqueness is also a good way to ensure user data structures are as expected.
758769

759-
In the following example, there are duplicate values of ``B`` in the right
760-
``DataFrame``. As this is not a one-to-one merge -- as specified in the
770+
In the following example, there are duplicate values of ``B`` in the right
771+
``DataFrame``. As this is not a one-to-one merge -- as specified in the
761772
``validate`` argument -- an exception will be raised.
762773

763774

@@ -770,11 +781,11 @@ In the following example, there are duplicate values of ``B`` in the right
770781
771782
In [53]: result = pd.merge(left, right, on='B', how='outer', validate="one_to_one")
772783
...
773-
MergeError: Merge keys are not unique in right dataset; not a one-to-one merge
784+
MergeError: Merge keys are not unique in right dataset; not a one-to-one merge
774785
775-
If the user is aware of the duplicates in the right ``DataFrame`` but wants to
776-
ensure there are no duplicates in the left DataFrame, one can use the
777-
``validate='one_to_many'`` argument instead, which will not raise an exception.
786+
If the user is aware of the duplicates in the right ``DataFrame`` but wants to
787+
ensure there are no duplicates in the left DataFrame, one can use the
788+
``validate='one_to_many'`` argument instead, which will not raise an exception.
778789

779790
.. ipython:: python
780791
@@ -786,8 +797,8 @@ ensure there are no duplicates in the left DataFrame, one can use the
786797
The merge indicator
787798
~~~~~~~~~~~~~~~~~~~
788799

789-
:func:`~pandas.merge` accepts the argument ``indicator``. If ``True``, a
790-
Categorical-type column called ``_merge`` will be added to the output object
800+
:func:`~pandas.merge` accepts the argument ``indicator``. If ``True``, a
801+
Categorical-type column called ``_merge`` will be added to the output object
791802
that takes on values:
792803

793804
=================================== ================
@@ -895,7 +906,7 @@ Joining on index
895906
~~~~~~~~~~~~~~~~
896907

897908
:meth:`DataFrame.join` is a convenient method for combining the columns of two
898-
potentially differently-indexed ``DataFrames`` into a single result
909+
potentially differently-indexed ``DataFrames`` into a single result
899910
``DataFrame``. Here is a very basic example:
900911

901912
.. ipython:: python
@@ -975,9 +986,15 @@ indexes:
975986
Joining key columns on an index
976987
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
977988

989+
<<<<<<< HEAD
990+
:meth:`~DataFrame.join` takes an optional ``on`` argument which may be a column
991+
or multiple column names, which specifies that the passed ``DataFrame`` is to be
992+
aligned on that column in the ``DataFrame``. These two function calls are
993+
=======
978994
:meth:`~DataFrame.join` takes an optional ``on`` argument which may be a column
979995
or multiple column names, which specifies that the passed ``DataFrame`` is to be
980996
aligned on that column in the ``DataFrame``. These two function calls are
997+
>>>>>>> remotes/upstream/master
981998
completely equivalent:
982999

9831000
::
@@ -987,7 +1004,11 @@ completely equivalent:
9871004
how='left', sort=False)
9881005

9891006
Obviously you can choose whichever form you find more convenient. For
1007+
<<<<<<< HEAD
1008+
many-to-one joins (where one of the ``DataFrame``'s is already indexed by the
1009+
=======
9901010
many-to-one joins (where one of the ``DataFrame``'s is already indexed by the
1011+
>>>>>>> remotes/upstream/master
9911012
join key), using ``join`` may be more convenient. Here is a simple example:
9921013

9931014
.. ipython:: python
@@ -1125,20 +1146,25 @@ This is equivalent but less verbose and more memory efficient / faster than this
11251146
Joining with two multi-indexes
11261147
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11271148

1128-
This is not implemented via ``join`` at-the-moment, however it can be done using
1129-
the following code.
1149+
As of Pandas 0.23.1 the :func:`Dataframe.join` can be used to join multi-indexed ``Dataframe`` instances on the overlaping index levels
11301150

11311151
.. ipython:: python
11321152
1133-
index = pd.MultiIndex.from_tuples([('K0', 'X0'), ('K0', 'X1'),
1153+
index_left = pd.MultiIndex.from_tuples([('K0', 'X0'), ('K0', 'X1'),
11341154
('K1', 'X2')],
11351155
names=['key', 'X'])
11361156
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
11371157
'B': ['B0', 'B1', 'B2']},
1138-
index=index)
1158+
index=index_left)
1159+
1160+
index_right = pd.MultiIndex.from_tuples([('K0', 'Y0'), ('K1', 'Y1'),
1161+
('K2', 'Y2'), ('K2', 'Y3')],
1162+
names=['key', 'Y'])
1163+
right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
1164+
'D': ['D0', 'D1', 'D2', 'D3']},
1165+
index=index_right)
11391166
1140-
result = pd.merge(left.reset_index(), right.reset_index(),
1141-
on=['key'], how='inner').set_index(['key','X','Y'])
1167+
left.join(right)
11421168
11431169
.. ipython:: python
11441170
:suppress:
@@ -1148,6 +1174,13 @@ the following code.
11481174
labels=['left', 'right'], vertical=False);
11491175
plt.close('all');
11501176
1177+
For earlier versions it can be done using the following.
1178+
1179+
.. ipython:: python
1180+
1181+
pd.merge(left.reset_index(), right.reset_index(),
1182+
on=['key'], how='inner').set_index(['key','X','Y'])
1183+
11511184
.. _merging.merge_on_columns_and_levels:
11521185

11531186
Merging on a combination of columns and index levels
@@ -1254,7 +1287,7 @@ similarly.
12541287
Joining multiple DataFrame or Panel objects
12551288
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
12561289

1257-
A list or tuple of ``DataFrames`` can also be passed to :meth:`~DataFrame.join`
1290+
A list or tuple of ``DataFrames`` can also be passed to :meth:`~DataFrame.join`
12581291
to join them together on their indexes.
12591292

12601293
.. ipython:: python
@@ -1276,7 +1309,7 @@ Merging together values within Series or DataFrame columns
12761309
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
12771310

12781311
Another fairly common situation is to have two like-indexed (or similarly
1279-
indexed) ``Series`` or ``DataFrame`` objects and wanting to "patch" values in
1312+
indexed) ``Series`` or ``DataFrame`` objects and wanting to "patch" values in
12801313
one object from values for matching indices in the other. Here is an example:
12811314

12821315
.. ipython:: python
@@ -1301,7 +1334,7 @@ For this, use the :meth:`~DataFrame.combine_first` method:
13011334
plt.close('all');
13021335
13031336
Note that this method only takes values from the right ``DataFrame`` if they are
1304-
missing in the left ``DataFrame``. A related method, :meth:`~DataFrame.update`,
1337+
missing in the left ``DataFrame``. A related method, :meth:`~DataFrame.update`,
13051338
alters non-NA values inplace:
13061339

13071340
.. ipython:: python
@@ -1353,15 +1386,15 @@ Merging AsOf
13531386

13541387
.. versionadded:: 0.19.0
13551388

1356-
A :func:`merge_asof` is similar to an ordered left-join except that we match on
1357-
nearest key rather than equal keys. For each row in the ``left`` ``DataFrame``,
1358-
we select the last row in the ``right`` ``DataFrame`` whose ``on`` key is less
1389+
A :func:`merge_asof` is similar to an ordered left-join except that we match on
1390+
nearest key rather than equal keys. For each row in the ``left`` ``DataFrame``,
1391+
we select the last row in the ``right`` ``DataFrame`` whose ``on`` key is less
13591392
than the left's key. Both DataFrames must be sorted by the key.
13601393

1361-
Optionally an asof merge can perform a group-wise merge. This matches the
1394+
Optionally an asof merge can perform a group-wise merge. This matches the
13621395
``by`` key equally, in addition to the nearest match on the ``on`` key.
13631396

1364-
For example; we might have ``trades`` and ``quotes`` and we want to ``asof``
1397+
For example; we might have ``trades`` and ``quotes`` and we want to ``asof``
13651398
merge them.
13661399

13671400
.. ipython:: python
@@ -1420,8 +1453,8 @@ We only asof within ``2ms`` between the quote time and the trade time.
14201453
by='ticker',
14211454
tolerance=pd.Timedelta('2ms'))
14221455
1423-
We only asof within ``10ms`` between the quote time and the trade time and we
1424-
exclude exact matches on time. Note that though we exclude the exact matches
1456+
We only asof within ``10ms`` between the quote time and the trade time and we
1457+
exclude exact matches on time. Note that though we exclude the exact matches
14251458
(of the quotes), prior quotes **do** propagate to that point in time.
14261459

14271460
.. ipython:: python

doc/source/whatsnew/v0.23.0.txt

+34
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,40 @@ The :func:`get_dummies` now accepts a ``dtype`` argument, which specifies a dtyp
5959
pd.get_dummies(df, columns=['c']).dtypes
6060
pd.get_dummies(df, columns=['c'], dtype=bool).dtypes
6161

62+
.. _whatsnew_0230.enhancements.join_with_two_multiindexes:
63+
64+
Joining with two multi-indexes
65+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
66+
67+
As of Pandas 0.23.1 the :func:`Dataframe.join` can be used to join multi-indexed ``Dataframe`` instances on the overlaping index levels
68+
69+
See the :ref:`Merge, join, and concatenate
70+
<merging.Join_with_two_multi_indexes>` documentation section.
71+
72+
.. ipython:: python
73+
74+
index_left = pd.MultiIndex.from_tuples([('K0', 'X0'), ('K0', 'X1'),
75+
('K1', 'X2')],
76+
names=['key', 'X'])
77+
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
78+
'B': ['B0', 'B1', 'B2']},
79+
index=index_left)
80+
81+
index_right = pd.MultiIndex.from_tuples([('K0', 'Y0'), ('K1', 'Y1'),
82+
('K2', 'Y2'), ('K2', 'Y3')],
83+
names=['key', 'Y'])
84+
right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
85+
'D': ['D0', 'D1', 'D2', 'D3']},
86+
index=index_right)
87+
88+
left.join(right)
89+
90+
For earlier versions it can be done using the following.
91+
92+
.. ipython:: python
93+
94+
pd.merge(left.reset_index(), right.reset_index(),
95+
on=['key'], how='inner').set_index(['key','X','Y'])
6296

6397
.. _whatsnew_0230.enhancements.merge_on_columns_and_levels:
6498

0 commit comments

Comments
 (0)