-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Create merge indicator for obs from left, right, or both #10054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8faf51d
to
76e4f33
Compare
|
||
|
||
def _indicator_post_merge(self, result, left, right): | ||
result['_left_indicator'].fillna(0, inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use inplace
. It makes the code much harder to read and is not used anywhere within the pandas codebase (well it shouldn't be)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Ok will update.
@jreback : for my own purposes: I thought inplace was preferable for large
datasets because it prevents unnecessary memory duplication. Am I wrong
that:
s = s.fillna(0)
Causes duplication of s in memory (at least temporarily) while:
s.fillna(0, inplace=True)
Does not?
On Mon, May 4, 2015 at 3:37 AM jreback [email protected] wrote:
In pandas/tools/merge.py
#10054 (comment):
columns = left.columns.values.tolist() + right.columns.values.tolist()
for i in ['_left_indicator', '_right_indicator', '_merge']:
if i in columns:
raise ValueError("Cannot use `indicator=True` option when data contains a column named {}".format(i))
left['_left_indicator'] = 1
left['_left_indicator'] = left['_left_indicator'].astype('int8')
right['_right_indicator'] = 2
right['_right_indicator'] = right['_right_indicator'].astype('int8')
- def _indicator_post_merge(self, result, left, right):
result['_left_indicator'].fillna(0, inplace=True)
don't use inplace. It makes the code much harder to read and is not used
anywhere within the pandas codebase (well it shouldn't be)—
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/pull/10054/files#r29576706.
@jreback : I've pushed a version that does not use The difficulty is the following: at the moment I'm adding indicator columns to the This can be done in one of two ways: (1) modify the input DataFrames by adding new column to those objects and then deleting those columns after the merge (as I do in commit 76e4f33), or (2) duplicate those objects and then add the columns to the copies of the original DataFrames (as in this commit, cf8412f). (1) seems preferable to me since it does not entail the duplication of both input DataFrames in memory. But I think (1) requires With
But if I do:
then that's not actually changing the input DataFrame, just creating a new DataFrame and pointing the variable So I think the question is: can we make an exception to the "no use of |
@nickeubank The semantics of almost all pandas methods (e.g. everything except in-place indexing / inplace kw), are NOT to touch the passed in data in any way. If you want an indicator for the merged data, I dont' see a problem with simply copying the input data, then do what you want. The marginal perf benefit is really small. We prefer correctness over perf anyday. |
OK, thanks for clarifying! Go to know both that that's how it works, and your preference ordering. :) |
@jreback One other question on priorities: I thought of a way to do this that would be slower than the version I've pushed, but would require less memory. Any rules of thumb for balancing those interests? Currently: Copy input data; add indicator columns; feed through merge; create an indicator of where each column came from based on those indicator columns. Alternative: Merge input data as normal; pull out merging columns from input data and merge them into the output, use those merges to create indicators for where each row came from. |
just copy the data much simpler |
Added docs. Anything else? |
you can move this to the 0.17.0 docs |
354026a
to
5f61765
Compare
@nickeubank can you rebase this and i'll take another look |
5f61765
to
1284a3a
Compare
@jreback rebased! |
@@ -864,6 +864,54 @@ def test_overlapping_columns_error_message(self): | |||
df2.columns = ['key1', 'foo', 'foo'] | |||
self.assertRaises(ValueError, merge, df, df2) | |||
|
|||
def test_indicator(self): | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number here
@jreback added suggested tweak to allow users to pass strings to |
@@ -115,6 +115,15 @@ | |||
side, respectively | |||
copy : boolean, default True | |||
If False, do not copy data unnecessarily | |||
indicator : boolean or string, default False | |||
If True, adds a column to output DataFrame called "_merge" with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a versionadded directive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check (I think that's formatted correctly?)
55118ba
to
0938647
Compare
@jreback sorry about scrubbing confusion. Think better now. |
@@ -523,6 +523,12 @@ Here's a description of what each argument is for: | |||
cases but may improve performance / memory usage. The cases where copying | |||
can be avoided are somewhat pathological but this option is provided | |||
nonetheless. | |||
- ``indicator``: Add a column to the output DataFrame called ``_merge`` | |||
with information on the source of each row. ``_merge`` is Categorical-type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a version added here as well
ok, code looks good. I like what you added to |
0938647
to
6eada5e
Compare
@jreback ok, updated. One note: not quite sure how best to format the versionadded advisory in the list of options for |
look at how changes in |
2c7e187
to
59e3f4d
Compare
ok -- now a table! |
Observation Origin ``_merge`` value | ||
=================================== ================ | ||
Merge key only in ``'left'`` frame ``left_only`` | ||
Merge key only in ``'right'`` frame ``left_only`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right_only
can you add some tests with multiple merge keys (e.g. a list for |
59e3f4d
to
6ac6532
Compare
@jreback added. Sidenote: I hadn't noticed this before, but merge type-coerces key columns from
|
@nickeubank this is a byproduct of that a-prior you don't know if you will have to put If you'd like to make another issue, go ahead. (though look thru existing ones as it might already be there). and of course PR's are welcome. |
@@ -33,6 +33,18 @@ Check the :ref:`API Changes <whatsnew_0170.api>` and :ref:`deprecations <whatsne | |||
New features | |||
~~~~~~~~~~~~ | |||
|
|||
- ``merge`` now accepts the argument ``indicator`` which adds a Categorical-type column to the output `DataFrame` that takes on a value of ``left_only`` for observations whose merge key only appears in ``'left'`` DataFrame, ``right_only`` for observations whose merge key only appears in ``'right'`` DataFrame, and ``both`` if the observation's merge key is found in both. For more see updated :ref:`docs <merging.indicator>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can put that mini-table here as well
6ac6532
to
e780bb5
Compare
@jreback table added! thanks for note about coercion. Doesn't seem like highest priority, but I'll keep it in mind for a place to contribute later! |
@jreback Anything else you'd like me to amend on this? |
wanted @jorisvandenbossche to have a look |
@nickeubank pls rebase @TomAugspurger @jorisvandenbossche any comments? |
e780bb5
to
addef51
Compare
@jreback rebased |
@shoyer quick look pls. |
This looks pretty clean to me. Nice work @nickeubank! |
ENH: Create merge indicator for obs from left, right, or both
thanks! |
xref #7412
closes #8790
Adds a new column (
_merge
) to DataFrames being merged that takes on a value of1
if observation was only in left df,2
if only in right df, and3
if in both. Designed to address #7412 and #8790.Update: new column now categorical with more informative labels of
left_only
,right_only
, andboth
.Still at draft stage, but want to get comments before trying to refine too much.