-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Adding keep parameter to merge method #31332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks @dcsaba89 - TBH this sounds like a good idea to me. Is it something you're interested in working on? @pandas-dev/pandas-core is this something pandas would consider accepting? |
this is not backward compatible at all (at with keep=True) you are doing a 1-many merge that is what you asked for; so now you want a 1-1 merge which is not the same i think we have an old issue about this where we had a parameter that would validate the type of merge (1-1, 1-many etc) and validate if that matches your data pls see if u can find and link this |
yes, #27430 |
thanks for finding @MarcoGorelli i like the idea of an explicit validation parameter on the type of merge; -0 in adding a keep parameter (default would have to be None) |
This is exactly backward compatible. if we would have a new keep parameter with default keep=True, the old pd.merge(df1, df2, how="left", on="A") would mean pd.merge(df1, df2, how="left", on="A", keep=True) which is a 1-many again, because you keep all the matches based on the given conditions. but you could have the option to say: avoid duplicating my rows in df1 when there are multiple matches in df2. |
In terms of merge validation, there exists the But it seems like there's not much further interest on adding a |
Consider the following example:
import pandas as pd
x = pd.DataFrame({'A': [1, 2, 3], 'X': [11, 12, 13]})
y = pd.DataFrame({'A': [1, 1, 2, 2, 3, 3], 'Y': [111, 111, 222, 222, 333, 333]})
We want to extend x with column 'Y' from y, based on 'A'.
x1 = pd.merge(x, y, how='left', on='A')
When we check the result, we see new rows coming from y when a key in 'A' occurs multiple times in y.
It would be great to do is this way:
x1 = pd.merge(x, y, how='left', on='A', keep='first')
However, unfortunately, merge does not have this keep parameter.
So, we have to figure out a workaround:
Before we could merge x and y, duplicates based on 'A' from y must be removed.
x1 = pd.merge(x, y.drop_duplicates(subset='A', keep='first'), on='A')
This issue generates lots of suffering for many people, who are confused because merging results in increasing number of rows.
https://stackoverflow.com/questions/37095161/number-of-rows-changes-even-after-pandas-merge-with-left-option
By implementing a feature of adding 'keep' parameter to merge, we can easily avoid increasing number of rows in our initial dataframe.
Additional notes:
As drop_duplicates has this keep parameter, it is easy to make it backward compatible by adding 'keep' (True by default) parameter to merge. In this case it should have the same result as earlier,
pd.merge(x, y, how='left', on='A', keep=True) == pd.merge(x, y, how='left', on='A')
Based on my ideas, keep='first' (example: pd.merge(x, y, how='left', on='A', keep='first')) would result the same as pd.merge(x, y.drop_duplicates(subset='A', keep='first'), on='A') currently, (where subset parameter of drop_duplicates is the same as on parameter of merge.
The text was updated successfully, but these errors were encountered: