-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Create merge indicator for obs from left, right, or both #10054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -489,9 +489,9 @@ standard database join operations between DataFrame objects: | |
|
||
:: | ||
|
||
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, | ||
left_index=False, right_index=False, sort=True, | ||
suffixes=('_x', '_y'), copy=True) | ||
merge(left, right, how='inner', on=None, left_on=None, right_on=None, | ||
left_index=False, right_index=False, sort=True, | ||
suffixes=('_x', '_y'), copy=True, indicator=False) | ||
|
||
Here's a description of what each argument is for: | ||
|
||
|
@@ -522,6 +522,15 @@ Here's a description of what each argument is for: | |
cases but may improve performance / memory usage. The cases where copying | ||
can be avoided are somewhat pathological but this option is provided | ||
nonetheless. | ||
- ``indicator``: Add a column to the output DataFrame called ``_merge`` | ||
with information on the source of each row. ``_merge`` is Categorical-type | ||
and takes on a value of ``left_only`` for observations whose merge key | ||
only appears in ``'left'`` DataFrame, ``right_only`` for observations whose | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. put the same example that you have in the whatsnew here (I would also make a sub-section I think) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. see below |
||
merge key only appears in ``'right'`` DataFrame, and ``both`` if the | ||
observation's merge key is found in both. | ||
|
||
.. versionadded:: 0.17.0 | ||
|
||
|
||
The return type will be the same as ``left``. If ``left`` is a ``DataFrame`` | ||
and ``right`` is a subclass of DataFrame, the return type will still be | ||
|
@@ -667,6 +676,36 @@ either the left or right tables, the values in the joined table will be | |
labels=['left', 'right'], vertical=False); | ||
plt.close('all'); | ||
|
||
.. _merging.indicator: | ||
|
||
The merge indicator | ||
~~~~~~~~~~~~~~~~~~~ | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add versionadded directive There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. check |
||
.. versionadded:: 0.17.0 | ||
|
||
``merge`` now accepts the argument ``indicator``. If ``True``, a Categorical-type column called ``_merge`` will be added to the output object that takes on values: | ||
|
||
=================================== ================ | ||
Observation Origin ``_merge`` value | ||
=================================== ================ | ||
Merge key only in ``'left'`` frame ``left_only`` | ||
Merge key only in ``'right'`` frame ``right_only`` | ||
Merge key in both frames ``both`` | ||
=================================== ================ | ||
|
||
.. ipython:: python | ||
|
||
df1 = DataFrame({'col1':[0,1], 'col_left':['a','b']}) | ||
df2 = DataFrame({'col1':[1,2,2],'col_right':[2,2,2]}) | ||
merge(df1, df2, on='col1', how='outer', indicator=True) | ||
|
||
The ``indicator`` argument will also accept string arguments, in which case the indicator function will use the value of the passed string as the name for the indicator column. | ||
|
||
.. ipython:: python | ||
|
||
merge(df1, df2, on='col1', how='outer', indicator='indicator_column') | ||
|
||
|
||
.. _merging.join.index: | ||
|
||
Joining on index | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -115,6 +115,17 @@ | |
side, respectively | ||
copy : boolean, default True | ||
If False, do not copy data unnecessarily | ||
indicator : boolean or string, default False | ||
If True, adds a column to output DataFrame called "_merge" with | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a versionadded directive There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Check (I think that's formatted correctly?) |
||
information on the source of each row. | ||
If string, column with information on source of each row will be added to | ||
output DataFrame, and column will be named value of string. | ||
Information column is Categorical-type and takes on a value of "left_only" | ||
for observations whose merge key only appears in 'left' DataFrame, | ||
"right_only" for observations whose merge key only appears in 'right' | ||
DataFrame, and "both" if the observation's merge key is found in both. | ||
|
||
.. versionadded:: 0.17.0 | ||
|
||
Examples | ||
-------- | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -27,11 +27,11 @@ | |
@Appender(_merge_doc, indents=0) | ||
def merge(left, right, how='inner', on=None, left_on=None, right_on=None, | ||
left_index=False, right_index=False, sort=False, | ||
suffixes=('_x', '_y'), copy=True): | ||
suffixes=('_x', '_y'), copy=True, indicator=False): | ||
op = _MergeOperation(left, right, how=how, on=on, left_on=left_on, | ||
right_on=right_on, left_index=left_index, | ||
right_index=right_index, sort=sort, suffixes=suffixes, | ||
copy=copy) | ||
copy=copy, indicator=indicator) | ||
return op.get_result() | ||
if __debug__: | ||
merge.__doc__ = _merge_doc % '\nleft : DataFrame' | ||
|
@@ -157,7 +157,7 @@ class _MergeOperation(object): | |
def __init__(self, left, right, how='inner', on=None, | ||
left_on=None, right_on=None, axis=1, | ||
left_index=False, right_index=False, sort=True, | ||
suffixes=('_x', '_y'), copy=True): | ||
suffixes=('_x', '_y'), copy=True, indicator=False): | ||
self.left = self.orig_left = left | ||
self.right = self.orig_right = right | ||
self.how = how | ||
|
@@ -174,12 +174,25 @@ def __init__(self, left, right, how='inner', on=None, | |
self.left_index = left_index | ||
self.right_index = right_index | ||
|
||
self.indicator = indicator | ||
|
||
if isinstance(self.indicator, compat.string_types): | ||
self.indicator_name = self.indicator | ||
elif isinstance(self.indicator, bool): | ||
self.indicator_name = '_merge' if self.indicator else None | ||
else: | ||
raise ValueError('indicator option can only accept boolean or string arguments') | ||
|
||
|
||
# note this function has side effects | ||
(self.left_join_keys, | ||
self.right_join_keys, | ||
self.join_names) = self._get_merge_keys() | ||
|
||
def get_result(self): | ||
if self.indicator: | ||
self.left, self.right = self._indicator_pre_merge(self.left, self.right) | ||
|
||
join_index, left_indexer, right_indexer = self._get_join_info() | ||
|
||
ldata, rdata = self.left._data, self.right._data | ||
|
@@ -199,10 +212,46 @@ def get_result(self): | |
typ = self.left._constructor | ||
result = typ(result_data).__finalize__(self, method='merge') | ||
|
||
if self.indicator: | ||
result = self._indicator_post_merge(result) | ||
|
||
self._maybe_add_join_keys(result, left_indexer, right_indexer) | ||
|
||
return result | ||
|
||
def _indicator_pre_merge(self, left, right): | ||
|
||
columns = left.columns.union(right.columns) | ||
|
||
for i in ['_left_indicator', '_right_indicator']: | ||
if i in columns: | ||
raise ValueError("Cannot use `indicator=True` option when data contains a column named {}".format(i)) | ||
if self.indicator_name in columns: | ||
raise ValueError("Cannot use name of an existing column for indicator column") | ||
|
||
left = left.copy() | ||
right = right.copy() | ||
|
||
left['_left_indicator'] = 1 | ||
left['_left_indicator'] = left['_left_indicator'].astype('int8') | ||
|
||
right['_right_indicator'] = 2 | ||
right['_right_indicator'] = right['_right_indicator'].astype('int8') | ||
|
||
return left, right | ||
|
||
def _indicator_post_merge(self, result): | ||
|
||
result['_left_indicator'] = result['_left_indicator'].fillna(0) | ||
result['_right_indicator'] = result['_right_indicator'].fillna(0) | ||
|
||
result[self.indicator_name] = Categorical((result['_left_indicator'] + result['_right_indicator']), categories=[1,2,3]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. your categories are There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The categories map onto the values of 1,2,3 (with 3 being both). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what is the fillna for then? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The flow is: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ahh I c ok |
||
result[self.indicator_name] = result[self.indicator_name].cat.rename_categories(['left_only', 'right_only', 'both']) | ||
|
||
result = result.drop(labels=['_left_indicator', '_right_indicator'], axis=1) | ||
|
||
return result | ||
|
||
def _maybe_add_join_keys(self, result, left_indexer, right_indexer): | ||
# insert group keys | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a version added here as well