Skip to content

Adding keep parameter to merge method #31332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dcsaba89 opened this issue Jan 26, 2020 · 6 comments
Closed

Adding keep parameter to merge method #31332

dcsaba89 opened this issue Jan 26, 2020 · 6 comments
Labels
Enhancement Needs Discussion Requires discussion from core team before further action

Comments

@dcsaba89
Copy link

Consider the following example:
import pandas as pd

x = pd.DataFrame({'A': [1, 2, 3], 'X': [11, 12, 13]})
y = pd.DataFrame({'A': [1, 1, 2, 2, 3, 3], 'Y': [111, 111, 222, 222, 333, 333]})

We want to extend x with column 'Y' from y, based on 'A'.

x1 = pd.merge(x, y, how='left', on='A')

When we check the result, we see new rows coming from y when a key in 'A' occurs multiple times in y.

It would be great to do is this way:

x1 = pd.merge(x, y, how='left', on='A', keep='first')

However, unfortunately, merge does not have this keep parameter.
So, we have to figure out a workaround:

Before we could merge x and y, duplicates based on 'A' from y must be removed.

x1 = pd.merge(x, y.drop_duplicates(subset='A', keep='first'), on='A')

This issue generates lots of suffering for many people, who are confused because merging results in increasing number of rows.

https://stackoverflow.com/questions/37095161/number-of-rows-changes-even-after-pandas-merge-with-left-option

By implementing a feature of adding 'keep' parameter to merge, we can easily avoid increasing number of rows in our initial dataframe.

Additional notes:

As drop_duplicates has this keep parameter, it is easy to make it backward compatible by adding 'keep' (True by default) parameter to merge. In this case it should have the same result as earlier,

pd.merge(x, y, how='left', on='A', keep=True) == pd.merge(x, y, how='left', on='A')

Based on my ideas, keep='first' (example: pd.merge(x, y, how='left', on='A', keep='first')) would result the same as pd.merge(x, y.drop_duplicates(subset='A', keep='first'), on='A') currently, (where subset parameter of drop_duplicates is the same as on parameter of merge.

@MarcoGorelli MarcoGorelli added Enhancement Needs Discussion Requires discussion from core team before further action labels Feb 13, 2020
@MarcoGorelli
Copy link
Member

Thanks @dcsaba89 - TBH this sounds like a good idea to me. Is it something you're interested in working on?

@pandas-dev/pandas-core is this something pandas would consider accepting?

@jreback
Copy link
Contributor

jreback commented Feb 13, 2020

this is not backward compatible at all (at with keep=True)

you are doing a 1-many merge that is what you asked for; so now you want a 1-1 merge which is not the same

i think we have an old issue about this where we had a parameter that would validate the type of merge (1-1, 1-many etc) and validate if that matches your data

pls see if u can find and link this

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Feb 13, 2020

i think we have an old issue about this where we had a parameter that would validate the type of merge (1-1, 1-many etc) and validate if that matches your data

yes, #27430

@jreback
Copy link
Contributor

jreback commented Feb 13, 2020

thanks for finding @MarcoGorelli

i like the idea of an explicit validation parameter on the type of merge; -0 in adding a keep parameter (default would have to be None)

@dcsaba89
Copy link
Author

dcsaba89 commented Mar 1, 2020

this is not backward compatible at all (at with keep=True)

you are doing a 1-many merge that is what you asked for; so now you want a 1-1 merge which is not the same

i think we have an old issue about this where we had a parameter that would validate the type of merge (1-1, 1-many etc) and validate if that matches your data

pls see if u can find and link this

This is exactly backward compatible.
you are right, now when you say pd.merge(df1, df2, how="left", on="A") you are doing a 1-many merge

if we would have a new keep parameter with default keep=True, the old pd.merge(df1, df2, how="left", on="A") would mean pd.merge(df1, df2, how="left", on="A", keep=True) which is a 1-many again, because you keep all the matches based on the given conditions.

but you could have the option to say: avoid duplicating my rows in df1 when there are multiple matches in df2.

@mroeschke
Copy link
Member

In terms of merge validation, there exists the validate parameter.

But it seems like there's not much further interest on adding a keep parameter along these lines among the other core devs (I think this would add some significant complexity to the merging logic). Closing due to lack of activity and interest, but happy to reopen if interest is revived

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants