Skip to content

Commit 4b9606b

Browse files
committed
Merge pull request #10054 from nickeubank/merge_indicator
ENH: Create merge indicator for obs from left, right, or both
2 parents f82e177 + addef51 commit 4b9606b

File tree

5 files changed

+205
-6
lines changed

5 files changed

+205
-6
lines changed

doc/source/merging.rst

+42-3
Original file line numberDiff line numberDiff line change
@@ -506,9 +506,9 @@ standard database join operations between DataFrame objects:
506506

507507
::
508508

509-
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
510-
left_index=False, right_index=False, sort=True,
511-
suffixes=('_x', '_y'), copy=True)
509+
merge(left, right, how='inner', on=None, left_on=None, right_on=None,
510+
left_index=False, right_index=False, sort=True,
511+
suffixes=('_x', '_y'), copy=True, indicator=False)
512512

513513
Here's a description of what each argument is for:
514514

@@ -539,6 +539,15 @@ Here's a description of what each argument is for:
539539
cases but may improve performance / memory usage. The cases where copying
540540
can be avoided are somewhat pathological but this option is provided
541541
nonetheless.
542+
- ``indicator``: Add a column to the output DataFrame called ``_merge``
543+
with information on the source of each row. ``_merge`` is Categorical-type
544+
and takes on a value of ``left_only`` for observations whose merge key
545+
only appears in ``'left'`` DataFrame, ``right_only`` for observations whose
546+
merge key only appears in ``'right'`` DataFrame, and ``both`` if the
547+
observation's merge key is found in both.
548+
549+
.. versionadded:: 0.17.0
550+
542551

543552
The return type will be the same as ``left``. If ``left`` is a ``DataFrame``
544553
and ``right`` is a subclass of DataFrame, the return type will still be
@@ -684,6 +693,36 @@ either the left or right tables, the values in the joined table will be
684693
labels=['left', 'right'], vertical=False);
685694
plt.close('all');
686695
696+
.. _merging.indicator:
697+
698+
The merge indicator
699+
~~~~~~~~~~~~~~~~~~~
700+
701+
.. versionadded:: 0.17.0
702+
703+
``merge`` now accepts the argument ``indicator``. If ``True``, a Categorical-type column called ``_merge`` will be added to the output object that takes on values:
704+
705+
=================================== ================
706+
Observation Origin ``_merge`` value
707+
=================================== ================
708+
Merge key only in ``'left'`` frame ``left_only``
709+
Merge key only in ``'right'`` frame ``right_only``
710+
Merge key in both frames ``both``
711+
=================================== ================
712+
713+
.. ipython:: python
714+
715+
df1 = DataFrame({'col1':[0,1], 'col_left':['a','b']})
716+
df2 = DataFrame({'col1':[1,2,2],'col_right':[2,2,2]})
717+
merge(df1, df2, on='col1', how='outer', indicator=True)
718+
719+
The ``indicator`` argument will also accept string arguments, in which case the indicator function will use the value of the passed string as the name for the indicator column.
720+
721+
.. ipython:: python
722+
723+
merge(df1, df2, on='col1', how='outer', indicator='indicator_column')
724+
725+
687726
.. _merging.join.index:
688727

689728
Joining on index

doc/source/whatsnew/v0.17.0.txt

+21
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,27 @@ Check the :ref:`API Changes <whatsnew_0170.api>` and :ref:`deprecations <whatsne
5151
New features
5252
~~~~~~~~~~~~
5353

54+
- ``merge`` now accepts the argument ``indicator`` which adds a Categorical-type column (by default called ``_merge``) to the output object that takes on the values:
55+
56+
=================================== ================
57+
Observation Origin ``_merge`` value
58+
=================================== ================
59+
Merge key only in ``'left'`` frame ``left_only``
60+
Merge key only in ``'right'`` frame ``right_only``
61+
Merge key in both frames ``both``
62+
=================================== ================
63+
64+
For more, see the :ref:`updated docs <merging.indicator>`
65+
66+
.. ipython:: python
67+
68+
df1 = pd.DataFrame({'col1':[0,1], 'col_left':['a','b']})
69+
df2 = pd.DataFrame({'col1':[1,2,2],'col_right':[2,2,2]})
70+
pd.merge(df1, df2, on='col1', how='outer', indicator=True)
71+
72+
73+
74+
5475
- ``DataFrame`` has the ``nlargest`` and ``nsmallest`` methods (:issue:`10393`)
5576
- SQL io functions now accept a SQLAlchemy connectable. (:issue:`7877`)
5677
- Enable writing complex values to HDF stores when using table format (:issue:`10447`)

pandas/core/frame.py

+11
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,17 @@
115115
side, respectively
116116
copy : boolean, default True
117117
If False, do not copy data unnecessarily
118+
indicator : boolean or string, default False
119+
If True, adds a column to output DataFrame called "_merge" with
120+
information on the source of each row.
121+
If string, column with information on source of each row will be added to
122+
output DataFrame, and column will be named value of string.
123+
Information column is Categorical-type and takes on a value of "left_only"
124+
for observations whose merge key only appears in 'left' DataFrame,
125+
"right_only" for observations whose merge key only appears in 'right'
126+
DataFrame, and "both" if the observation's merge key is found in both.
127+
128+
.. versionadded:: 0.17.0
118129
119130
Examples
120131
--------

pandas/tools/merge.py

+52-3
Original file line numberDiff line numberDiff line change
@@ -27,11 +27,11 @@
2727
@Appender(_merge_doc, indents=0)
2828
def merge(left, right, how='inner', on=None, left_on=None, right_on=None,
2929
left_index=False, right_index=False, sort=False,
30-
suffixes=('_x', '_y'), copy=True):
30+
suffixes=('_x', '_y'), copy=True, indicator=False):
3131
op = _MergeOperation(left, right, how=how, on=on, left_on=left_on,
3232
right_on=right_on, left_index=left_index,
3333
right_index=right_index, sort=sort, suffixes=suffixes,
34-
copy=copy)
34+
copy=copy, indicator=indicator)
3535
return op.get_result()
3636
if __debug__:
3737
merge.__doc__ = _merge_doc % '\nleft : DataFrame'
@@ -157,7 +157,7 @@ class _MergeOperation(object):
157157
def __init__(self, left, right, how='inner', on=None,
158158
left_on=None, right_on=None, axis=1,
159159
left_index=False, right_index=False, sort=True,
160-
suffixes=('_x', '_y'), copy=True):
160+
suffixes=('_x', '_y'), copy=True, indicator=False):
161161
self.left = self.orig_left = left
162162
self.right = self.orig_right = right
163163
self.how = how
@@ -174,12 +174,25 @@ def __init__(self, left, right, how='inner', on=None,
174174
self.left_index = left_index
175175
self.right_index = right_index
176176

177+
self.indicator = indicator
178+
179+
if isinstance(self.indicator, compat.string_types):
180+
self.indicator_name = self.indicator
181+
elif isinstance(self.indicator, bool):
182+
self.indicator_name = '_merge' if self.indicator else None
183+
else:
184+
raise ValueError('indicator option can only accept boolean or string arguments')
185+
186+
177187
# note this function has side effects
178188
(self.left_join_keys,
179189
self.right_join_keys,
180190
self.join_names) = self._get_merge_keys()
181191

182192
def get_result(self):
193+
if self.indicator:
194+
self.left, self.right = self._indicator_pre_merge(self.left, self.right)
195+
183196
join_index, left_indexer, right_indexer = self._get_join_info()
184197

185198
ldata, rdata = self.left._data, self.right._data
@@ -199,10 +212,46 @@ def get_result(self):
199212
typ = self.left._constructor
200213
result = typ(result_data).__finalize__(self, method='merge')
201214

215+
if self.indicator:
216+
result = self._indicator_post_merge(result)
217+
202218
self._maybe_add_join_keys(result, left_indexer, right_indexer)
203219

204220
return result
205221

222+
def _indicator_pre_merge(self, left, right):
223+
224+
columns = left.columns.union(right.columns)
225+
226+
for i in ['_left_indicator', '_right_indicator']:
227+
if i in columns:
228+
raise ValueError("Cannot use `indicator=True` option when data contains a column named {}".format(i))
229+
if self.indicator_name in columns:
230+
raise ValueError("Cannot use name of an existing column for indicator column")
231+
232+
left = left.copy()
233+
right = right.copy()
234+
235+
left['_left_indicator'] = 1
236+
left['_left_indicator'] = left['_left_indicator'].astype('int8')
237+
238+
right['_right_indicator'] = 2
239+
right['_right_indicator'] = right['_right_indicator'].astype('int8')
240+
241+
return left, right
242+
243+
def _indicator_post_merge(self, result):
244+
245+
result['_left_indicator'] = result['_left_indicator'].fillna(0)
246+
result['_right_indicator'] = result['_right_indicator'].fillna(0)
247+
248+
result[self.indicator_name] = Categorical((result['_left_indicator'] + result['_right_indicator']), categories=[1,2,3])
249+
result[self.indicator_name] = result[self.indicator_name].cat.rename_categories(['left_only', 'right_only', 'both'])
250+
251+
result = result.drop(labels=['_left_indicator', '_right_indicator'], axis=1)
252+
253+
return result
254+
206255
def _maybe_add_join_keys(self, result, left_indexer, right_indexer):
207256
# insert group keys
208257

pandas/tools/tests/test_merge.py

+79
Original file line numberDiff line numberDiff line change
@@ -946,6 +946,85 @@ def test_overlapping_columns_error_message(self):
946946
df2.columns = ['key1', 'foo', 'foo']
947947
self.assertRaises(ValueError, merge, df, df2)
948948

949+
def test_indicator(self):
950+
# PR #10054. xref #7412 and closes #8790.
951+
df1 = pd.DataFrame({'col1':[0,1], 'col_left':['a','b'], 'col_conflict':[1,2]})
952+
df1_copy = df1.copy()
953+
954+
df2 = pd.DataFrame({'col1':[1,2,3,4,5],'col_right':[2,2,2,2,2],
955+
'col_conflict':[1,2,3,4,5]})
956+
df2_copy = df2.copy()
957+
958+
df_result = pd.DataFrame({'col1':[0,1,2,3,4,5],
959+
'col_conflict_x':[1,2,np.nan,np.nan,np.nan,np.nan],
960+
'col_left':['a','b', np.nan,np.nan,np.nan,np.nan],
961+
'col_conflict_y':[np.nan,1,2,3,4,5],
962+
'col_right':[np.nan, 2,2,2,2,2]},
963+
dtype='float64')
964+
df_result['_merge'] = pd.Categorical(['left_only','both','right_only',
965+
'right_only','right_only','right_only']
966+
, categories=['left_only', 'right_only', 'both'])
967+
968+
df_result = df_result[['col1', 'col_conflict_x', 'col_left',
969+
'col_conflict_y', 'col_right', '_merge' ]]
970+
971+
test = pd.merge(df1, df2, on='col1', how='outer', indicator=True)
972+
assert_frame_equal(test, df_result)
973+
974+
# No side effects
975+
assert_frame_equal(df1, df1_copy)
976+
assert_frame_equal(df2, df2_copy)
977+
978+
# Check with custom name
979+
df_result_custom_name = df_result
980+
df_result_custom_name = df_result_custom_name.rename(columns={'_merge':'custom_name'})
981+
982+
test_custom_name = pd.merge(df1, df2, on='col1', how='outer', indicator='custom_name')
983+
assert_frame_equal(test_custom_name, df_result_custom_name)
984+
985+
# Check only accepts strings and booleans
986+
with tm.assertRaises(ValueError):
987+
pd.merge(df1, df2, on='col1', how='outer', indicator=5)
988+
989+
# Check result integrity
990+
991+
test2 = pd.merge(df1, df2, on='col1', how='left', indicator=True)
992+
self.assertTrue((test2._merge != 'right_only').all())
993+
994+
test3 = pd.merge(df1, df2, on='col1', how='right', indicator=True)
995+
self.assertTrue((test3._merge != 'left_only').all())
996+
997+
test4 = pd.merge(df1, df2, on='col1', how='inner', indicator=True)
998+
self.assertTrue((test4._merge == 'both').all())
999+
1000+
# Check if working name in df
1001+
for i in ['_right_indicator', '_left_indicator', '_merge']:
1002+
df_badcolumn = pd.DataFrame({'col1':[1,2], i:[2,2]})
1003+
1004+
with tm.assertRaises(ValueError):
1005+
pd.merge(df1, df_badcolumn, on='col1', how='outer', indicator=True)
1006+
1007+
# Check for name conflict with custom name
1008+
df_badcolumn = pd.DataFrame({'col1':[1,2], 'custom_column_name':[2,2]})
1009+
1010+
with tm.assertRaises(ValueError):
1011+
pd.merge(df1, df_badcolumn, on='col1', how='outer', indicator='custom_column_name')
1012+
1013+
# Merge on multiple columns
1014+
df3 = pd.DataFrame({'col1':[0,1], 'col2':['a','b']})
1015+
1016+
df4 = pd.DataFrame({'col1':[1,1,3], 'col2':['b','x','y']})
1017+
1018+
hand_coded_result = pd.DataFrame({'col1':[0,1,1,3.0],
1019+
'col2':['a','b','x','y']})
1020+
hand_coded_result['_merge'] = pd.Categorical(
1021+
['left_only','both','right_only','right_only']
1022+
, categories=['left_only', 'right_only', 'both'])
1023+
1024+
test5 = pd.merge(df3, df4, on=['col1', 'col2'], how='outer', indicator=True)
1025+
assert_frame_equal(test5, hand_coded_result)
1026+
1027+
9491028
def _check_merge(x, y):
9501029
for how in ['inner', 'left', 'outer']:
9511030
result = x.join(y, how=how)

0 commit comments

Comments
 (0)