Skip to content

Commit ab27073

Browse files
committed
Merge pull request #6363 from jreback/mi_merge_single
ENH/BUG: allow single versus multi-index joining on inferred level (GH3662)
2 parents 7e8d90b + dd79e55 commit ab27073

File tree

5 files changed

+247
-7
lines changed

5 files changed

+247
-7
lines changed

doc/source/merging.rst

+75-1
Original file line numberDiff line numberDiff line change
@@ -307,7 +307,7 @@ the data in DataFrame.
307307

308308
See the :ref:`cookbook<cookbook.merge>` for some advanced strategies.
309309

310-
Users who are familiar with SQL but new to pandas might be interested in a
310+
Users who are familiar with SQL but new to pandas might be interested in a
311311
:ref:`comparison with SQL<compare_with_sql.join>`.
312312

313313
pandas provides a single function, ``merge``, as the entry point for all
@@ -610,3 +610,77 @@ values inplace:
610610
611611
df1.update(df2)
612612
df1
613+
614+
.. _merging.on_mi:
615+
616+
Merging with Multi-indexes
617+
--------------------------
618+
619+
.. _merging.join_on_mi:
620+
621+
Joining a single Index to a Multi-index
622+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
623+
624+
.. versionadded:: 0.14.0
625+
626+
You can join a singly-indexed DataFrame with a level of a multi-indexed DataFrame.
627+
The level will match on the name of the index of the singly-indexed frame against
628+
a level name of the multi-indexed frame.
629+
630+
.. ipython:: python
631+
632+
household = DataFrame(dict(household_id = [1,2,3],
633+
male = [0,1,0],
634+
wealth = [196087.3,316478.7,294750]),
635+
columns = ['household_id','male','wealth']
636+
).set_index('household_id')
637+
household
638+
portfolio = DataFrame(dict(household_id = [1,2,2,3,3,3,4],
639+
asset_id = ["nl0000301109","nl0000289783","gb00b03mlx29",
640+
"gb00b03mlx29","lu0197800237","nl0000289965",np.nan],
641+
name = ["ABN Amro","Robeco","Royal Dutch Shell","Royal Dutch Shell",
642+
"AAB Eastern Europe Equity Fund","Postbank BioTech Fonds",np.nan],
643+
share = [1.0,0.4,0.6,0.15,0.6,0.25,1.0]),
644+
columns = ['household_id','asset_id','name','share']
645+
).set_index(['household_id','asset_id'])
646+
portfolio
647+
648+
household.join(portfolio, how='inner')
649+
650+
This is equivalent but less verbose and more memory efficient / faster than this.
651+
652+
.. code-block:: python
653+
654+
merge(household.reset_index(),
655+
portfolio.reset_index(),
656+
on=['household_id'],
657+
how='inner'
658+
).set_index(['household_id','asset_id'])
659+
660+
Joining with two multi-indexes
661+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
662+
663+
This is not Implemented via ``join`` at-the-moment, however it can be done using the following.
664+
665+
.. ipython:: python
666+
667+
household = DataFrame(dict(household_id = [1,2,2,3,3,3,4],
668+
asset_id = ["nl0000301109","nl0000301109","gb00b03mlx29",
669+
"gb00b03mlx29","lu0197800237","nl0000289965",np.nan],
670+
share = [1.0,0.4,0.6,0.15,0.6,0.25,1.0]),
671+
columns = ['household_id','asset_id','share']
672+
).set_index(['household_id','asset_id'])
673+
household
674+
675+
log_return = DataFrame(dict(asset_id = ["gb00b03mlx29", "gb00b03mlx29", "gb00b03mlx29",
676+
"lu0197800237", "lu0197800237"],
677+
t = [233, 234, 235, 180, 181],
678+
log_return = [.09604978, -.06524096, .03532373, .03025441, .036997]),
679+
).set_index(["asset_id","t"])
680+
log_return
681+
682+
merge(household.reset_index(),
683+
log_return.reset_index(),
684+
on=['asset_id'],
685+
how='inner'
686+
).set_index(['household_id','asset_id','t'])

doc/source/release.rst

+1
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,7 @@ Improvements to existing features
7878
(:issue:`6014`)
7979
- Allow multi-index slicers (:issue:`6134`, :issue:`4036`, :issue:`3057`, :issue:`2598`, :issue:`5641`)
8080
- improve performance of slice indexing on Series with string keys (:issue:`6341`)
81+
- implement joining a single-level indexed DataFrame on a matching column of a multi-indexed DataFrame (:issue:`3662`)
8182

8283
.. _release.bug_fixes-0.14.0:
8384

doc/source/v0.14.0.txt

+25-2
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,8 @@ users upgrade to this version.
99

1010
Highlights include:
1111

12-
-
13-
12+
- MultIndexing Using Slicers
13+
- Joining a singly-indexed DataFrame with a multi-indexed DataFrame
1414

1515
API changes
1616
~~~~~~~~~~~
@@ -155,6 +155,29 @@ Enhancements
155155
most plot kinds. (:issue:`6014`)
156156
- improve performance of slice indexing on Series with string keys (:issue:`6341`)
157157
- Hexagonal bin plots from ``DataFrame.plot`` with ``kind='hexbin'`` (:issue:`5478`)
158+
- Joining a singly-indexed DataFrame with a multi-indexed DataFrame (:issue:`3662`)
159+
160+
See :ref:`the docs<merging.join_on_mi>`. Joining multi-index DataFrames on both the left and right is not yet supported ATM.
161+
162+
.. ipython:: python
163+
164+
household = DataFrame(dict(household_id = [1,2,3],
165+
male = [0,1,0],
166+
wealth = [196087.3,316478.7,294750]),
167+
columns = ['household_id','male','wealth']
168+
).set_index('household_id')
169+
household
170+
portfolio = DataFrame(dict(household_id = [1,2,2,3,3,3,4],
171+
asset_id = ["nl0000301109","nl0000289783","gb00b03mlx29",
172+
"gb00b03mlx29","lu0197800237","nl0000289965",np.nan],
173+
name = ["ABN Amro","Robeco","Royal Dutch Shell","Royal Dutch Shell",
174+
"AAB Eastern Europe Equity Fund","Postbank BioTech Fonds",np.nan],
175+
share = [1.0,0.4,0.6,0.15,0.6,0.25,1.0]),
176+
columns = ['household_id','asset_id','name','share']
177+
).set_index(['household_id','asset_id'])
178+
portfolio
179+
180+
household.join(portfolio, how='inner')
158181

159182
Performance
160183
~~~~~~~~~~~

pandas/core/index.py

+52-2
Original file line numberDiff line numberDiff line change
@@ -1265,8 +1265,21 @@ def join(self, other, how='left', level=None, return_indexers=False):
12651265
-------
12661266
join_index, (left_indexer, right_indexer)
12671267
"""
1268-
if (level is not None and (isinstance(self, MultiIndex) or
1269-
isinstance(other, MultiIndex))):
1268+
self_is_mi = isinstance(self, MultiIndex)
1269+
other_is_mi = isinstance(other, MultiIndex)
1270+
1271+
# try to figure out the join level
1272+
# GH3662
1273+
if (level is None and (self_is_mi or other_is_mi)):
1274+
1275+
# have the same levels/names so a simple join
1276+
if self.names == other.names:
1277+
pass
1278+
else:
1279+
return self._join_multi(other, how=how, return_indexers=return_indexers)
1280+
1281+
# join on the level
1282+
if (level is not None and (self_is_mi or other_is_mi)):
12701283
return self._join_level(other, level, how=how,
12711284
return_indexers=return_indexers)
12721285

@@ -1344,6 +1357,43 @@ def join(self, other, how='left', level=None, return_indexers=False):
13441357
else:
13451358
return join_index
13461359

1360+
def _join_multi(self, other, how, return_indexers=True):
1361+
1362+
self_is_mi = isinstance(self, MultiIndex)
1363+
other_is_mi = isinstance(other, MultiIndex)
1364+
1365+
# figure out join names
1366+
self_names = [ n for n in self.names if n is not None ]
1367+
other_names = [ n for n in other.names if n is not None ]
1368+
overlap = list(set(self_names) & set(other_names))
1369+
1370+
# need at least 1 in common, but not more than 1
1371+
if not len(overlap):
1372+
raise ValueError("cannot join with no level specified and no overlapping names")
1373+
if len(overlap) > 1:
1374+
raise NotImplementedError("merging with more than one level overlap on a multi-index is not implemented")
1375+
jl = overlap[0]
1376+
1377+
# make the indices into mi's that match
1378+
if not (self_is_mi and other_is_mi):
1379+
1380+
flip_order = False
1381+
if self_is_mi:
1382+
self, other = other, self
1383+
flip_order = True
1384+
1385+
level = other.names.index(jl)
1386+
result = self._join_level(other, level, how=how,
1387+
return_indexers=return_indexers)
1388+
1389+
if flip_order:
1390+
if isinstance(result, tuple):
1391+
return result[0], result[2], result[1]
1392+
return result
1393+
1394+
# 2 multi-indexes
1395+
raise NotImplementedError("merging with both multi-indexes is not implemented")
1396+
13471397
def _join_non_unique(self, other, how='left', return_indexers=False):
13481398
from pandas.tools.merge import _get_join_indexers
13491399

pandas/tools/tests/test_merge.py

+94-2
Original file line numberDiff line numberDiff line change
@@ -8,15 +8,15 @@
88
import numpy as np
99
import random
1010

11-
from pandas.compat import range, lrange, lzip, zip
11+
from pandas.compat import range, lrange, lzip, zip, StringIO
1212
from pandas import compat, _np_version_under1p7
1313
from pandas.tseries.index import DatetimeIndex
1414
from pandas.tools.merge import merge, concat, ordered_merge, MergeError
1515
from pandas.util.testing import (assert_frame_equal, assert_series_equal,
1616
assert_almost_equal, rands,
1717
makeCustomDataframe as mkdf,
1818
assertRaisesRegexp)
19-
from pandas import isnull, DataFrame, Index, MultiIndex, Panel, Series, date_range
19+
from pandas import isnull, DataFrame, Index, MultiIndex, Panel, Series, date_range, read_table
2020
import pandas.algos as algos
2121
import pandas.util.testing as tm
2222

@@ -1025,6 +1025,98 @@ def test_int64_overflow_issues(self):
10251025
result = merge(df1, df2, how='outer')
10261026
self.assertTrue(len(result) == 2000)
10271027

1028+
def test_join_multi_levels(self):
1029+
1030+
# GH 3662
1031+
# merge multi-levels
1032+
1033+
household = DataFrame(dict(household_id = [1,2,3],
1034+
male = [0,1,0],
1035+
wealth = [196087.3,316478.7,294750]),
1036+
columns = ['household_id','male','wealth']).set_index('household_id')
1037+
portfolio = DataFrame(dict(household_id = [1,2,2,3,3,3,4],
1038+
asset_id = ["nl0000301109","nl0000289783","gb00b03mlx29","gb00b03mlx29","lu0197800237","nl0000289965",np.nan],
1039+
name = ["ABN Amro","Robeco","Royal Dutch Shell","Royal Dutch Shell","AAB Eastern Europe Equity Fund","Postbank BioTech Fonds",np.nan],
1040+
share = [1.0,0.4,0.6,0.15,0.6,0.25,1.0]),
1041+
columns = ['household_id','asset_id','name','share']).set_index(['household_id','asset_id'])
1042+
result = household.join(portfolio, how='inner')
1043+
expected = DataFrame(dict(male = [0,1,1,0,0,0],
1044+
wealth = [ 196087.3, 316478.7, 316478.7, 294750.0, 294750.0, 294750.0 ],
1045+
name = ['ABN Amro','Robeco','Royal Dutch Shell','Royal Dutch Shell','AAB Eastern Europe Equity Fund','Postbank BioTech Fonds'],
1046+
share = [1.00,0.40,0.60,0.15,0.60,0.25],
1047+
household_id = [1,2,2,3,3,3],
1048+
asset_id = ['nl0000301109','nl0000289783','gb00b03mlx29','gb00b03mlx29','lu0197800237','nl0000289965']),
1049+
).set_index(['household_id','asset_id']).reindex(columns=['male','wealth','name','share'])
1050+
assert_frame_equal(result,expected)
1051+
1052+
assert_frame_equal(result,expected)
1053+
1054+
# equivalency
1055+
result2 = merge(household.reset_index(),portfolio.reset_index(),on=['household_id'],how='inner').set_index(['household_id','asset_id'])
1056+
assert_frame_equal(result2,expected)
1057+
1058+
result = household.join(portfolio, how='outer')
1059+
expected = concat([expected,DataFrame(dict(share = [1.00]),
1060+
index=MultiIndex.from_tuples([(4,np.nan)],
1061+
names=['household_id','asset_id']))],
1062+
axis=0).reindex(columns=expected.columns)
1063+
assert_frame_equal(result,expected)
1064+
1065+
# invalid cases
1066+
household.index.name = 'foo'
1067+
def f():
1068+
household.join(portfolio, how='inner')
1069+
self.assertRaises(ValueError, f)
1070+
1071+
portfolio2 = portfolio.copy()
1072+
portfolio2.index.set_names(['household_id','foo'])
1073+
def f():
1074+
portfolio2.join(portfolio, how='inner')
1075+
self.assertRaises(ValueError, f)
1076+
1077+
def test_join_multi_levels2(self):
1078+
1079+
# some more advanced merges
1080+
# GH6360
1081+
household = DataFrame(dict(household_id = [1,2,2,3,3,3,4],
1082+
asset_id = ["nl0000301109","nl0000301109","gb00b03mlx29","gb00b03mlx29","lu0197800237","nl0000289965",np.nan],
1083+
share = [1.0,0.4,0.6,0.15,0.6,0.25,1.0]),
1084+
columns = ['household_id','asset_id','share']).set_index(['household_id','asset_id'])
1085+
1086+
log_return = DataFrame(dict(
1087+
asset_id = ["gb00b03mlx29", "gb00b03mlx29", "gb00b03mlx29", "lu0197800237", "lu0197800237"],
1088+
t = [233, 234, 235, 180, 181],
1089+
log_return = [.09604978, -.06524096, .03532373, .03025441, .036997]
1090+
)).set_index(["asset_id","t"])
1091+
1092+
expected = DataFrame(dict(
1093+
household_id = [2, 2, 2, 3, 3, 3, 3, 3],
1094+
asset_id = ["gb00b03mlx29", "gb00b03mlx29", "gb00b03mlx29", "gb00b03mlx29", "gb00b03mlx29", "gb00b03mlx29", "lu0197800237", "lu0197800237"],
1095+
t = [233, 234, 235, 233, 234, 235, 180, 181],
1096+
share = [0.6, 0.6, 0.6, 0.15, 0.15, 0.15, 0.6, 0.6],
1097+
log_return = [.09604978, -.06524096, .03532373, .09604978, -.06524096, .03532373, .03025441, .036997]
1098+
)).set_index(["household_id", "asset_id", "t"]).reindex(columns=['share','log_return'])
1099+
1100+
def f():
1101+
household.join(log_return, how='inner')
1102+
self.assertRaises(NotImplementedError, f)
1103+
1104+
# this is the equivalency
1105+
result = merge(household.reset_index(),log_return.reset_index(),on=['asset_id'],how='inner').set_index(['household_id','asset_id','t'])
1106+
assert_frame_equal(result,expected)
1107+
1108+
expected = DataFrame(dict(
1109+
household_id = [1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4],
1110+
asset_id = ["nl0000301109", "nl0000289783", "gb00b03mlx29", "gb00b03mlx29", "gb00b03mlx29", "gb00b03mlx29", "gb00b03mlx29", "gb00b03mlx29", "lu0197800237", "lu0197800237", "nl0000289965", None],
1111+
t = [None, None, 233, 234, 235, 233, 234, 235, 180, 181, None, None],
1112+
share = [1.0, 0.4, 0.6, 0.6, 0.6, 0.15, 0.15, 0.15, 0.6, 0.6, 0.25, 1.0],
1113+
log_return = [None, None, .09604978, -.06524096, .03532373, .09604978, -.06524096, .03532373, .03025441, .036997, None, None]
1114+
)).set_index(["household_id", "asset_id", "t"])
1115+
1116+
def f():
1117+
household.join(log_return, how='outer')
1118+
self.assertRaises(NotImplementedError, f)
1119+
10281120
def _check_join(left, right, result, join_col, how='left',
10291121
lsuffix='_x', rsuffix='_y'):
10301122

0 commit comments

Comments
 (0)