ENH: Create merge indicator for obs from left, right, or both #10054

nickeubank · 2015-05-03T17:12:39Z

Adds a new column (_merge) to DataFrames being merged that takes on a value of 1 if observation was only in left df, 2 if only in right df, and 3 if in both. Designed to address #7412 and #8790.

Update: new column now categorical with more informative labels of left_only, right_only, and both.

Still at draft stage, but want to get comments before trying to refine too much.

jreback · 2015-05-04T10:36:54Z

pandas/tools/merge.py

+
+
+    def _indicator_post_merge(self, result, left, right):
+        result['_left_indicator'].fillna(0, inplace=True)        


don't use inplace. It makes the code much harder to read and is not used anywhere within the pandas codebase (well it shouldn't be)

Oh! Ok will update.

@jreback : for my own purposes: I thought inplace was preferable for large
datasets because it prevents unnecessary memory duplication. Am I wrong
that:

s = s.fillna(0)

Causes duplication of s in memory (at least temporarily) while:

s.fillna(0, inplace=True)

Does not?
On Mon, May 4, 2015 at 3:37 AM jreback [email protected] wrote:

In pandas/tools/merge.py
#10054 (comment):

columns = left.columns.values.tolist() + right.columns.values.tolist()

for i in ['_left_indicator', '_right_indicator', '_merge']:

if i in columns:

raise ValueError("Cannot use `indicator=True` option when data contains a column named {}".format(i))

left['_left_indicator'] = 1

left['_left_indicator'] = left['_left_indicator'].astype('int8')

right['_right_indicator'] = 2

right['_right_indicator'] = right['_right_indicator'].astype('int8')

def _indicator_post_merge(self, result, left, right):

result['_left_indicator'].fillna(0, inplace=True)

don't use inplace. It makes the code much harder to read and is not used
anywhere within the pandas codebase (well it shouldn't be)

—
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/pull/10054/files#r29576706.

nickeubank · 2015-05-04T18:39:52Z

@jreback : I've pushed a version that does not use inplace, but it comes at a cost. Namely, I don't think I can avoid making a full duplicates of both input DataFrames if I don't use inplace=True for the drop() commands.

The difficulty is the following: at the moment I'm adding indicator columns to the left and right DataFrames (which I can do without inplace), passing the updated DataFrames into the concatenate_block_managers, then using those columns post-merge to identify the source of each row.

This can be done in one of two ways: (1) modify the input DataFrames by adding new column to those objects and then deleting those columns after the merge (as I do in commit 76e4f33), or (2) duplicate those objects and then add the columns to the copies of the original DataFrames (as in this commit, cf8412f).

(1) seems preferable to me since it does not entail the duplication of both input DataFrames in memory. But I think (1) requires inplace modification to remove those indicator columns.

With inplace=True, I can cleanup in _indicator_post_merge() as follows:

left.drop(label='_left_indicator', axis=1, inplace=True)

But if I do:

left = left.drop(label='_left_indicator', axis=1)

then that's not actually changing the input DataFrame, just creating a new DataFrame and pointing the variable left to it. But once the function finishes, since merge doesn't return new versions of the input DataFrames, those changes disappear, and the input DataFrames still have the extra columns.

So I think the question is: can we make an exception to the "no use of inplace rule" to allow this option to not fully duplicate both input DataFrames?

jreback · 2015-05-04T18:57:16Z

@nickeubank inplace ops, rarely are faster. They most often copy, modify the copy, the make it so the original variable points to the copy.

The semantics of almost all pandas methods (e.g. everything except in-place indexing / inplace kw), are NOT to touch the passed in data in any way.

If you want an indicator for the merged data, I dont' see a problem with simply copying the input data, then do what you want.

The marginal perf benefit is really small. We prefer correctness over perf anyday.

nickeubank · 2015-05-04T18:59:48Z

OK, thanks for clarifying! Go to know both that that's how it works, and your preference ordering. :)

nickeubank · 2015-05-08T04:45:48Z

@jreback One other question on priorities: I thought of a way to do this that would be slower than the version I've pushed, but would require less memory. Any rules of thumb for balancing those interests?

Currently: Copy input data; add indicator columns; feed through merge; create an indicator of where each column came from based on those indicator columns.
-> One merge, but have to make copies of both input DataFrames.

Alternative: Merge input data as normal; pull out merging columns from input data and merge them into the output, use those merges to create indicators for where each row came from.
-> Three merges (initial, and one for each input), but no full copying of input DataFrames.

jreback · 2015-05-08T10:57:54Z

just copy the data much simpler

nickeubank · 2015-05-10T03:50:53Z

Added docs. Anything else?

jreback · 2015-05-10T14:20:35Z

you can move this to the 0.17.0 docs

jreback · 2015-07-28T21:55:42Z

@nickeubank can you rebase this and i'll take another look

nickeubank · 2015-07-28T23:35:22Z

@jreback rebased!

jreback · 2015-08-15T23:28:07Z

pandas/tools/tests/test_merge.py

@@ -864,6 +864,54 @@ def test_overlapping_columns_error_message(self):
        df2.columns = ['key1', 'foo', 'foo']
        self.assertRaises(ValueError, merge, df, df2)

+    def test_indicator(self):
+


add the issue number here

nickeubank · 2015-08-16T20:51:08Z

@jreback added suggested tweak to allow users to pass strings to indicator as name for indicator column.

jreback · 2015-08-19T21:07:16Z

pandas/core/frame.py

@@ -115,6 +115,15 @@
    side, respectively
 copy : boolean, default True
    If False, do not copy data unnecessarily
+indicator : boolean or string, default False
+    If True, adds a column to output DataFrame called "_merge" with 


add a versionadded directive

Check (I think that's formatted correctly?)

nickeubank · 2015-08-20T18:20:49Z

@jreback sorry about scrubbing confusion. Think better now.

jreback · 2015-08-20T18:24:06Z

doc/source/merging.rst

@@ -523,6 +523,12 @@ Here's a description of what each argument is for:
    cases but may improve performance / memory usage. The cases where copying
    can be avoided are somewhat pathological but this option is provided
    nonetheless.
+  - ``indicator``: Add a column to the output DataFrame called ``_merge``
+    with information on the source of each row. ``_merge`` is Categorical-type 


add a version added here as well

jreback · 2015-08-20T18:24:54Z

ok, code looks good. I like what you added to merging.rst, can you put in bullet points or a table-like for better readability (the the meaning of the returned values in the merge indicator)

nickeubank · 2015-08-20T18:42:16Z

@jreback ok, updated.

One note: not quite sure how best to format the versionadded advisory in the list of options for merge -- made an attempt but don't love it. Let me know if there's a better way.

jreback · 2015-08-20T18:52:52Z

look at how changes in convert_objects is done in whatsnew/v0.17.0 (a table is created its pretty easy)

nickeubank · 2015-08-20T19:09:23Z

ok -- now a table!

jreback · 2015-08-20T19:11:15Z

doc/source/merging.rst

+  Observation Origin                    ``_merge`` value
+  ===================================   ================
+  Merge key only in ``'left'`` frame    ``left_only``
+  Merge key only in ``'right'`` frame   ``left_only``


jreback · 2015-08-20T19:12:42Z

can you add some tests with multiple merge keys (e.g. a list for on). seems should still work.

nickeubank · 2015-08-20T19:50:44Z

@jreback added.

Sidenote: I hadn't noticed this before, but merge type-coerces key columns from int to float if you do an outer merge, even though the output is all int. Is that a known behavior?:

In[30]:
df1 = pd.DataFrame({'col1':[0,1], 'colleft':['a','b']})
df2 = pd.DataFrame({'col1':[1,3], 'colright':['b','x']})
pd.merge(df1, df2, how='outer', on='col1').col1.dtypes
Out[30]: dtype('float64')

In[31]:
pd.merge(df1, df2, how='inner', on='col1').col1.dtypes
Out[31]: dtype('int64')

jreback · 2015-08-20T22:15:47Z

@nickeubank this is a byproduct of that a-prior you don't know if you will have to put NaN in columns, so you just coerce to a nan-capable type then go from there. I suppose you could in this case coerce back.

If you'd like to make another issue, go ahead. (though look thru existing ones as it might already be there). and of course PR's are welcome.

jreback · 2015-08-20T22:16:36Z

doc/source/whatsnew/v0.17.0.txt

@@ -33,6 +33,18 @@ Check the :ref:`API Changes <whatsnew_0170.api>` and :ref:`deprecations <whatsne
 New features
 ~~~~~~~~~~~~

+- ``merge`` now accepts the argument ``indicator`` which adds a Categorical-type column to the output `DataFrame` that takes on a value of ``left_only`` for observations whose merge key only appears in ``'left'`` DataFrame, ``right_only`` for observations whose merge key only appears in ``'right'`` DataFrame, and ``both`` if the observation's merge key is found in both. For more see updated :ref:`docs <merging.indicator>`


you can put that mini-table here as well

nickeubank · 2015-08-21T01:42:59Z

@jreback table added!

thanks for note about coercion. Doesn't seem like highest priority, but I'll keep it in mind for a place to contribute later!

nickeubank · 2015-08-24T16:30:41Z

@jreback Anything else you'd like me to amend on this?

jreback · 2015-08-24T19:21:51Z

wanted @jorisvandenbossche to have a look

jreback · 2015-09-01T11:57:43Z

@nickeubank pls rebase

@TomAugspurger @jorisvandenbossche any comments?

nickeubank · 2015-09-01T15:47:06Z

@jreback rebased

jreback · 2015-09-03T13:42:31Z

@shoyer quick look pls.

shoyer · 2015-09-04T00:31:46Z

This looks pretty clean to me. Nice work @nickeubank!

ENH: Create merge indicator for obs from left, right, or both

jreback · 2015-09-04T00:51:57Z

thanks!

nickeubank force-pushed the merge_indicator branch 7 times, most recently from 8faf51d to 76e4f33 Compare May 3, 2015 23:23

jreback reviewed May 4, 2015
View reviewed changes

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design labels May 4, 2015

jreback added this to the 0.17.0 milestone May 10, 2015

nickeubank force-pushed the merge_indicator branch from 354026a to 5f61765 Compare May 10, 2015 16:34

nickeubank force-pushed the merge_indicator branch from 5f61765 to 1284a3a Compare July 28, 2015 23:35

jreback reviewed Aug 15, 2015
View reviewed changes

jreback reviewed Aug 19, 2015
View reviewed changes

nickeubank force-pushed the merge_indicator branch from 55118ba to 0938647 Compare August 20, 2015 18:19

jreback reviewed Aug 20, 2015
View reviewed changes

nickeubank force-pushed the merge_indicator branch from 0938647 to 6eada5e Compare August 20, 2015 18:41

nickeubank force-pushed the merge_indicator branch 2 times, most recently from 2c7e187 to 59e3f4d Compare August 20, 2015 19:09

jreback reviewed Aug 20, 2015
View reviewed changes

nickeubank force-pushed the merge_indicator branch from 59e3f4d to 6ac6532 Compare August 20, 2015 19:45

jreback reviewed Aug 20, 2015
View reviewed changes

nickeubank force-pushed the merge_indicator branch from 6ac6532 to e780bb5 Compare August 21, 2015 01:42

Create indicator for obs from left, right, or both

addef51

nickeubank force-pushed the merge_indicator branch from e780bb5 to addef51 Compare September 1, 2015 15:46

jreback added a commit that referenced this pull request Sep 4, 2015

Merge pull request #10054 from nickeubank/merge_indicator

4b9606b

ENH: Create merge indicator for obs from left, right, or both

jreback merged commit 4b9606b into pandas-dev:master Sep 4, 2015

chris-b1 mentioned this pull request Sep 20, 2015

ENH: add merge indicator to DataFrame.merge #11154

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Create merge indicator for obs from left, right, or both #10054

ENH: Create merge indicator for obs from left, right, or both #10054

nickeubank commented May 3, 2015

jreback May 4, 2015

nickeubank May 4, 2015

nickeubank commented May 4, 2015

jreback commented May 4, 2015

nickeubank commented May 4, 2015

nickeubank commented May 8, 2015

jreback commented May 8, 2015

nickeubank commented May 10, 2015

jreback commented May 10, 2015

jreback commented Jul 28, 2015

nickeubank commented Jul 28, 2015

jreback Aug 15, 2015

nickeubank commented Aug 16, 2015

jreback Aug 19, 2015

nickeubank Aug 20, 2015

nickeubank commented Aug 20, 2015

jreback Aug 20, 2015

jreback commented Aug 20, 2015

nickeubank commented Aug 20, 2015

jreback commented Aug 20, 2015

nickeubank commented Aug 20, 2015

jreback Aug 20, 2015

jreback commented Aug 20, 2015

nickeubank commented Aug 20, 2015

jreback commented Aug 20, 2015

jreback Aug 20, 2015

nickeubank commented Aug 21, 2015

nickeubank commented Aug 24, 2015

jreback commented Aug 24, 2015

jreback commented Sep 1, 2015

nickeubank commented Sep 1, 2015

jreback commented Sep 3, 2015

shoyer commented Sep 4, 2015

jreback commented Sep 4, 2015



		def _indicator_post_merge(self, result, left, right):
		result['_left_indicator'].fillna(0, inplace=True)

ENH: Create merge indicator for obs from left, right, or both #10054

ENH: Create merge indicator for obs from left, right, or both #10054

Conversation

nickeubank commented May 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nickeubank commented May 4, 2015

jreback commented May 4, 2015

nickeubank commented May 4, 2015

nickeubank commented May 8, 2015

jreback commented May 8, 2015

nickeubank commented May 10, 2015

jreback commented May 10, 2015

jreback commented Jul 28, 2015

nickeubank commented Jul 28, 2015

Choose a reason for hiding this comment

nickeubank commented Aug 16, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nickeubank commented Aug 20, 2015

Choose a reason for hiding this comment

jreback commented Aug 20, 2015

nickeubank commented Aug 20, 2015

jreback commented Aug 20, 2015

nickeubank commented Aug 20, 2015

Choose a reason for hiding this comment

jreback commented Aug 20, 2015

nickeubank commented Aug 20, 2015

jreback commented Aug 20, 2015

Choose a reason for hiding this comment

nickeubank commented Aug 21, 2015

nickeubank commented Aug 24, 2015

jreback commented Aug 24, 2015

jreback commented Sep 1, 2015

nickeubank commented Sep 1, 2015

jreback commented Sep 3, 2015

shoyer commented Sep 4, 2015

jreback commented Sep 4, 2015