Adding keep parameter to merge method #31332

dcsaba89 · 2020-01-26T23:17:32Z

Consider the following example:
import pandas as pd

x = pd.DataFrame({'A': [1, 2, 3], 'X': [11, 12, 13]})
y = pd.DataFrame({'A': [1, 1, 2, 2, 3, 3], 'Y': [111, 111, 222, 222, 333, 333]})

We want to extend x with column 'Y' from y, based on 'A'.

x1 = pd.merge(x, y, how='left', on='A')

When we check the result, we see new rows coming from y when a key in 'A' occurs multiple times in y.

It would be great to do is this way:

x1 = pd.merge(x, y, how='left', on='A', keep='first')

However, unfortunately, merge does not have this keep parameter.
So, we have to figure out a workaround:

Before we could merge x and y, duplicates based on 'A' from y must be removed.

x1 = pd.merge(x, y.drop_duplicates(subset='A', keep='first'), on='A')

This issue generates lots of suffering for many people, who are confused because merging results in increasing number of rows.

https://stackoverflow.com/questions/37095161/number-of-rows-changes-even-after-pandas-merge-with-left-option

By implementing a feature of adding 'keep' parameter to merge, we can easily avoid increasing number of rows in our initial dataframe.

Additional notes:

As drop_duplicates has this keep parameter, it is easy to make it backward compatible by adding 'keep' (True by default) parameter to merge. In this case it should have the same result as earlier,

pd.merge(x, y, how='left', on='A', keep=True) == pd.merge(x, y, how='left', on='A')

Based on my ideas, keep='first' (example: pd.merge(x, y, how='left', on='A', keep='first')) would result the same as pd.merge(x, y.drop_duplicates(subset='A', keep='first'), on='A') currently, (where subset parameter of drop_duplicates is the same as on parameter of merge.

MarcoGorelli · 2020-02-13T08:50:26Z

Thanks @dcsaba89 - TBH this sounds like a good idea to me. Is it something you're interested in working on?

@pandas-dev/pandas-core is this something pandas would consider accepting?

jreback · 2020-02-13T11:30:35Z

this is not backward compatible at all (at with keep=True)

you are doing a 1-many merge that is what you asked for; so now you want a 1-1 merge which is not the same

i think we have an old issue about this where we had a parameter that would validate the type of merge (1-1, 1-many etc) and validate if that matches your data

pls see if u can find and link this

MarcoGorelli · 2020-02-13T12:08:17Z

i think we have an old issue about this where we had a parameter that would validate the type of merge (1-1, 1-many etc) and validate if that matches your data

yes, #27430

jreback · 2020-02-13T12:22:23Z

thanks for finding @MarcoGorelli

i like the idea of an explicit validation parameter on the type of merge; -0 in adding a keep parameter (default would have to be None)

dcsaba89 · 2020-03-01T19:11:55Z

this is not backward compatible at all (at with keep=True)

you are doing a 1-many merge that is what you asked for; so now you want a 1-1 merge which is not the same

i think we have an old issue about this where we had a parameter that would validate the type of merge (1-1, 1-many etc) and validate if that matches your data

pls see if u can find and link this

This is exactly backward compatible.
you are right, now when you say pd.merge(df1, df2, how="left", on="A") you are doing a 1-many merge

if we would have a new keep parameter with default keep=True, the old pd.merge(df1, df2, how="left", on="A") would mean pd.merge(df1, df2, how="left", on="A", keep=True) which is a 1-many again, because you keep all the matches based on the given conditions.

but you could have the option to say: avoid duplicating my rows in df1 when there are multiple matches in df2.

mroeschke · 2021-07-27T05:25:45Z

In terms of merge validation, there exists the validate parameter.

But it seems like there's not much further interest on adding a keep parameter along these lines among the other core devs (I think this would add some significant complexity to the merging logic). Closing due to lack of activity and interest, but happy to reopen if interest is revived

MarcoGorelli added Enhancement Needs Discussion Requires discussion from core team before further action labels Feb 13, 2020

mroeschke closed this as completed Jul 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding keep parameter to merge method #31332

Adding keep parameter to merge method #31332

dcsaba89 commented Jan 26, 2020

MarcoGorelli commented Feb 13, 2020

jreback commented Feb 13, 2020

MarcoGorelli commented Feb 13, 2020 •

edited

Loading

jreback commented Feb 13, 2020

dcsaba89 commented Mar 1, 2020

mroeschke commented Jul 27, 2021

Adding keep parameter to merge method #31332

Adding keep parameter to merge method #31332

Comments

dcsaba89 commented Jan 26, 2020

MarcoGorelli commented Feb 13, 2020

jreback commented Feb 13, 2020

MarcoGorelli commented Feb 13, 2020 • edited Loading

jreback commented Feb 13, 2020

dcsaba89 commented Mar 1, 2020

mroeschke commented Jul 27, 2021

MarcoGorelli commented Feb 13, 2020 •

edited

Loading