Skip to content

ENH: Create merge indicator for obs from left, right, or both #10054

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 4, 2015

Conversation

nickeubank
Copy link
Contributor

xref #7412
closes #8790

Adds a new column (_merge) to DataFrames being merged that takes on a value of 1 if observation was only in left df, 2 if only in right df, and 3 if in both. Designed to address #7412 and #8790.

Update: new column now categorical with more informative labels of left_only, right_only, and both.

Still at draft stage, but want to get comments before trying to refine too much.

@nickeubank nickeubank force-pushed the merge_indicator branch 7 times, most recently from 8faf51d to 76e4f33 Compare May 3, 2015 23:23


def _indicator_post_merge(self, result, left, right):
result['_left_indicator'].fillna(0, inplace=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use inplace. It makes the code much harder to read and is not used anywhere within the pandas codebase (well it shouldn't be)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Ok will update.

@jreback : for my own purposes: I thought inplace was preferable for large
datasets because it prevents unnecessary memory duplication. Am I wrong
that:

s = s.fillna(0)

Causes duplication of s in memory (at least temporarily) while:

s.fillna(0, inplace=True)

Does not?
On Mon, May 4, 2015 at 3:37 AM jreback [email protected] wrote:

In pandas/tools/merge.py
#10054 (comment):

  •    columns = left.columns.values.tolist() + right.columns.values.tolist()
    
  •    for i in ['_left_indicator', '_right_indicator', '_merge']:
    
  •        if i in columns:
    
  •            raise ValueError("Cannot use `indicator=True` option when data contains a column named {}".format(i))
    
  •    left['_left_indicator'] = 1
    
  •    left['_left_indicator'] = left['_left_indicator'].astype('int8')
    
  •    right['_right_indicator'] = 2
    
  •    right['_right_indicator'] = right['_right_indicator'].astype('int8')
    
  • def _indicator_post_merge(self, result, left, right):
  •    result['_left_indicator'].fillna(0, inplace=True)
    

don't use inplace. It makes the code much harder to read and is not used
anywhere within the pandas codebase (well it shouldn't be)


Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/pull/10054/files#r29576706.

@nickeubank
Copy link
Contributor Author

@jreback : I've pushed a version that does not use inplace, but it comes at a cost. Namely, I don't think I can avoid making a full duplicates of both input DataFrames if I don't use inplace=True for the drop() commands.

The difficulty is the following: at the moment I'm adding indicator columns to the left and right DataFrames (which I can do without inplace), passing the updated DataFrames into the concatenate_block_managers, then using those columns post-merge to identify the source of each row.

This can be done in one of two ways: (1) modify the input DataFrames by adding new column to those objects and then deleting those columns after the merge (as I do in commit 76e4f33), or (2) duplicate those objects and then add the columns to the copies of the original DataFrames (as in this commit, cf8412f).

(1) seems preferable to me since it does not entail the duplication of both input DataFrames in memory. But I think (1) requires inplace modification to remove those indicator columns.

With inplace=True, I can cleanup in _indicator_post_merge() as follows:

left.drop(label='_left_indicator', axis=1, inplace=True)

But if I do:

left = left.drop(label='_left_indicator', axis=1)

then that's not actually changing the input DataFrame, just creating a new DataFrame and pointing the variable left to it. But once the function finishes, since merge doesn't return new versions of the input DataFrames, those changes disappear, and the input DataFrames still have the extra columns.

So I think the question is: can we make an exception to the "no use of inplace rule" to allow this option to not fully duplicate both input DataFrames?

@jreback
Copy link
Contributor

jreback commented May 4, 2015

@nickeubank inplace ops, rarely are faster. They most often copy, modify the copy, the make it so the original variable points to the copy.

The semantics of almost all pandas methods (e.g. everything except in-place indexing / inplace kw), are NOT to touch the passed in data in any way.

If you want an indicator for the merged data, I dont' see a problem with simply copying the input data, then do what you want.

The marginal perf benefit is really small. We prefer correctness over perf anyday.

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design labels May 4, 2015
@nickeubank
Copy link
Contributor Author

OK, thanks for clarifying! Go to know both that that's how it works, and your preference ordering. :)

@nickeubank
Copy link
Contributor Author

@jreback One other question on priorities: I thought of a way to do this that would be slower than the version I've pushed, but would require less memory. Any rules of thumb for balancing those interests?

Currently: Copy input data; add indicator columns; feed through merge; create an indicator of where each column came from based on those indicator columns.
-> One merge, but have to make copies of both input DataFrames.

Alternative: Merge input data as normal; pull out merging columns from input data and merge them into the output, use those merges to create indicators for where each row came from.
-> Three merges (initial, and one for each input), but no full copying of input DataFrames.

@jreback
Copy link
Contributor

jreback commented May 8, 2015

just copy the data much simpler

@nickeubank
Copy link
Contributor Author

Added docs. Anything else?

@jreback jreback added this to the 0.17.0 milestone May 10, 2015
@jreback
Copy link
Contributor

jreback commented May 10, 2015

you can move this to the 0.17.0 docs

@jreback
Copy link
Contributor

jreback commented Jul 28, 2015

@nickeubank can you rebase this and i'll take another look

@nickeubank
Copy link
Contributor Author

@jreback rebased!

@@ -864,6 +864,54 @@ def test_overlapping_columns_error_message(self):
df2.columns = ['key1', 'foo', 'foo']
self.assertRaises(ValueError, merge, df, df2)

def test_indicator(self):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number here

@nickeubank
Copy link
Contributor Author

@jreback added suggested tweak to allow users to pass strings to indicator as name for indicator column.

@@ -115,6 +115,15 @@
side, respectively
copy : boolean, default True
If False, do not copy data unnecessarily
indicator : boolean or string, default False
If True, adds a column to output DataFrame called "_merge" with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionadded directive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check (I think that's formatted correctly?)

@nickeubank
Copy link
Contributor Author

@jreback sorry about scrubbing confusion. Think better now.

@@ -523,6 +523,12 @@ Here's a description of what each argument is for:
cases but may improve performance / memory usage. The cases where copying
can be avoided are somewhat pathological but this option is provided
nonetheless.
- ``indicator``: Add a column to the output DataFrame called ``_merge``
with information on the source of each row. ``_merge`` is Categorical-type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a version added here as well

@jreback
Copy link
Contributor

jreback commented Aug 20, 2015

ok, code looks good. I like what you added to merging.rst, can you put in bullet points or a table-like for better readability (the the meaning of the returned values in the merge indicator)

@nickeubank
Copy link
Contributor Author

@jreback ok, updated.

One note: not quite sure how best to format the versionadded advisory in the list of options for merge -- made an attempt but don't love it. Let me know if there's a better way.

@jreback
Copy link
Contributor

jreback commented Aug 20, 2015

look at how changes in convert_objects is done in whatsnew/v0.17.0 (a table is created its pretty easy)

@nickeubank nickeubank force-pushed the merge_indicator branch 2 times, most recently from 2c7e187 to 59e3f4d Compare August 20, 2015 19:09
@nickeubank
Copy link
Contributor Author

ok -- now a table!

Observation Origin ``_merge`` value
=================================== ================
Merge key only in ``'left'`` frame ``left_only``
Merge key only in ``'right'`` frame ``left_only``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right_only

@jreback
Copy link
Contributor

jreback commented Aug 20, 2015

can you add some tests with multiple merge keys (e.g. a list for on). seems should still work.

@nickeubank
Copy link
Contributor Author

@jreback added.

Sidenote: I hadn't noticed this before, but merge type-coerces key columns from int to float if you do an outer merge, even though the output is all int. Is that a known behavior?:

In[30]:
df1 = pd.DataFrame({'col1':[0,1], 'colleft':['a','b']})
df2 = pd.DataFrame({'col1':[1,3], 'colright':['b','x']})
pd.merge(df1, df2, how='outer', on='col1').col1.dtypes
Out[30]: dtype('float64')

In[31]:
pd.merge(df1, df2, how='inner', on='col1').col1.dtypes
Out[31]: dtype('int64')    

@jreback
Copy link
Contributor

jreback commented Aug 20, 2015

@nickeubank this is a byproduct of that a-prior you don't know if you will have to put NaN in columns, so you just coerce to a nan-capable type then go from there. I suppose you could in this case coerce back.

If you'd like to make another issue, go ahead. (though look thru existing ones as it might already be there). and of course PR's are welcome.

@@ -33,6 +33,18 @@ Check the :ref:`API Changes <whatsnew_0170.api>` and :ref:`deprecations <whatsne
New features
~~~~~~~~~~~~

- ``merge`` now accepts the argument ``indicator`` which adds a Categorical-type column to the output `DataFrame` that takes on a value of ``left_only`` for observations whose merge key only appears in ``'left'`` DataFrame, ``right_only`` for observations whose merge key only appears in ``'right'`` DataFrame, and ``both`` if the observation's merge key is found in both. For more see updated :ref:`docs <merging.indicator>`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can put that mini-table here as well

@nickeubank
Copy link
Contributor Author

@jreback table added!

thanks for note about coercion. Doesn't seem like highest priority, but I'll keep it in mind for a place to contribute later!

@nickeubank
Copy link
Contributor Author

@jreback Anything else you'd like me to amend on this?

@jreback
Copy link
Contributor

jreback commented Aug 24, 2015

wanted @jorisvandenbossche to have a look

@jreback
Copy link
Contributor

jreback commented Sep 1, 2015

@nickeubank pls rebase

@TomAugspurger @jorisvandenbossche any comments?

@nickeubank
Copy link
Contributor Author

@jreback rebased

@jreback
Copy link
Contributor

jreback commented Sep 3, 2015

@shoyer quick look pls.

@shoyer
Copy link
Member

shoyer commented Sep 4, 2015

This looks pretty clean to me. Nice work @nickeubank!

jreback added a commit that referenced this pull request Sep 4, 2015
ENH: Create merge indicator for obs from left, right, or both
@jreback jreback merged commit 4b9606b into pandas-dev:master Sep 4, 2015
@jreback
Copy link
Contributor

jreback commented Sep 4, 2015

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: row-level Merge Status Variable
3 participants