-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Option to keep left/right join columns (or add _merge column) to merge() and concat() #7412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
pls specify an example of the input and output using simple copy-pastable code, much easier for people to play with / get solns. |
Sure thing: from pandas import DataFrame, merge
import numpy as np
df = DataFrame(np.random.randn(10, 2), columns=["id", "sex"])
df2 = DataFrame(np.random.randn(10, 2), columns=["user_id", "name"])
df.id = range(10)
df2.user_id = range(3,13)
merge(df, df2, left_on="id", right_on="user_id", how="outer")
""" Returns:
id sex user_id name
0 0 -0.254309 NaN NaN
1 1 -0.363123 NaN NaN
2 2 -0.408873 NaN NaN
3 3 -1.209845 3 0.578440
4 4 0.952290 4 -1.336396
5 5 -0.091704 5 0.255794
6 6 0.984578 6 -0.469222
7 7 -0.694126 7 1.197256
8 8 0.369942 8 -0.656366
9 9 1.544090 9 -0.975548
10 NaN NaN 10 -1.827958
11 NaN NaN 11 -1.523407
12 NaN NaN 12 -0.785032
"""
# But when merging on index / same column, you can't do this because the merge column gets joined into one and you lose all merge source information
merge(df.set_index("id"), df2.set_index("user_id"), left_index=True, right_index=True, how="outer")
"""
sex name
0 -0.254309 NaN
1 -0.363123 NaN
2 -0.408873 NaN
3 -1.209845 0.578440
4 0.952290 -1.336396
5 -0.091704 0.255794
6 0.984578 -0.469222
7 -0.694126 1.197256
8 0.369942 -0.656366
9 1.544090 -0.975548
10 NaN -1.827958
11 NaN -1.523407
12 NaN -0.785032
"""
# What you'd want after setting merge_info=True
"""
sex name _merge
0 -0.254309 NaN 1
1 -0.363123 NaN 1
2 -0.408873 NaN 1
3 -1.209845 0.578440 3
4 0.952290 -1.336396 3
5 -0.091704 0.255794 3
6 0.984578 -0.469222 3
7 -0.694126 1.197256 3
8 0.369942 -0.656366 3
9 1.544090 -0.975548 3
10 NaN -1.827958 2
11 NaN -1.523407 2
12 NaN -0.785032 2
""" |
work for you?
|
@jreback : Is there an easy way to get the _merge column in @makmanalp's comment? The NaNs could indicate _merge results, but they are ambiguous - they could have existed in the Series already. |
@makmanalp: I've made a project to execute Stata-like commands with Pandas. This allows for Stata-like merges that produce a merge variable: |
@jreback sorry to leave this thread hanging, looks like I'd missed your response. I think that does what I need for the most part. It does seem like a bit of a hack though. @shafiquejamal Thanks, this is pretty neat. It'll come in handy when converting stata folks over. |
@kobejohn not sure what you are asking about. can you show an example. I don't see anything being dropped. |
Ok my apologies - I misinterpreted both my results and this thread. I'll delete the comment and then get more sleep. |
np I think this issue was more about having an indicator where the merged column came FROM |
Hello!
I just heard from a colleague that they're looking for the analogue of STATA's merge command (http://www.stata.com/help.cgi?merge) which generates a
_merge
column that includes a code which specifies in an outer join whether the row existed in the right table, the left table or both. I know you can hack your way around this by doing set operations on the join columns / indices or creating new columns, but there could be an argument for having this be included functionality if it could be done simultaneously during the merge or just for sheer convenience.The use case specified was that after they merged, they were checking over the data to find inconsistencies and rows that should have been merged but somehow didn't.
Let me know if there would be any interest in this, and I could maybe have a first shot at implementing it.
The text was updated successfully, but these errors were encountered: