Skip to content

Option to keep left/right join columns (or add _merge column) to merge() and concat() #7412

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
makmanalp opened this issue Jun 9, 2014 · 9 comments
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@makmanalp
Copy link
Contributor

Hello!

I just heard from a colleague that they're looking for the analogue of STATA's merge command (http://www.stata.com/help.cgi?merge) which generates a _merge column that includes a code which specifies in an outer join whether the row existed in the right table, the left table or both. I know you can hack your way around this by doing set operations on the join columns / indices or creating new columns, but there could be an argument for having this be included functionality if it could be done simultaneously during the merge or just for sheer convenience.

The use case specified was that after they merged, they were checking over the data to find inconsistencies and rows that should have been merged but somehow didn't.

Let me know if there would be any interest in this, and I could maybe have a first shot at implementing it.

@makmanalp makmanalp changed the title Add option to keep left / right join columns on merge() and concat() Option to keep left/right join columns (or add _merge column) to merge() and concat() Jun 9, 2014
@jreback
Copy link
Contributor

jreback commented Jun 9, 2014

pls specify an example of the input and output using simple copy-pastable code, much easier for people to play with / get solns.

@makmanalp
Copy link
Contributor Author

Sure thing:

from pandas import DataFrame, merge
import numpy as np

df = DataFrame(np.random.randn(10, 2), columns=["id", "sex"])
df2 = DataFrame(np.random.randn(10, 2), columns=["user_id", "name"])
df.id = range(10)
df2.user_id = range(3,13)

merge(df, df2, left_on="id", right_on="user_id",  how="outer")
""" Returns:
    id       sex  user_id      name
0    0 -0.254309      NaN       NaN
1    1 -0.363123      NaN       NaN
2    2 -0.408873      NaN       NaN
3    3 -1.209845        3  0.578440
4    4  0.952290        4 -1.336396
5    5 -0.091704        5  0.255794
6    6  0.984578        6 -0.469222
7    7 -0.694126        7  1.197256
8    8  0.369942        8 -0.656366
9    9  1.544090        9 -0.975548
10 NaN       NaN       10 -1.827958
11 NaN       NaN       11 -1.523407
12 NaN       NaN       12 -0.785032
"""

# But when merging on index / same column, you can't do this because the merge column gets joined into one and you lose all merge source information

merge(df.set_index("id"), df2.set_index("user_id"), left_index=True, right_index=True,  how="outer")
"""
         sex      name
0  -0.254309       NaN
1  -0.363123       NaN
2  -0.408873       NaN
3  -1.209845  0.578440
4   0.952290 -1.336396
5  -0.091704  0.255794
6   0.984578 -0.469222
7  -0.694126  1.197256
8   0.369942 -0.656366
9   1.544090 -0.975548
10       NaN -1.827958
11       NaN -1.523407
12       NaN -0.785032
"""

# What you'd want after setting merge_info=True

"""
         sex      name    _merge
0  -0.254309       NaN         1
1  -0.363123       NaN         1
2  -0.408873       NaN         1
3  -1.209845  0.578440         3
4   0.952290 -1.336396         3
5  -0.091704  0.255794         3
6   0.984578 -0.469222         3
7  -0.694126  1.197256         3
8   0.369942 -0.656366         3
9   1.544090 -0.975548         3
10       NaN -1.827958         2
11       NaN -1.523407         2
12       NaN -0.785032         2
"""

@jreback
Copy link
Contributor

jreback commented Jun 9, 2014

work for you?

In [14]: pd.merge(df.set_index("id",drop=False), df2.set_index("user_id",drop=False), left_index=True, right_index=True,  how="outer")
Out[14]: 
    id       sex  user_id      name
0    0  1.365492      NaN       NaN
1    1 -0.598057      NaN       NaN
2    2 -1.092018      NaN       NaN
3    3 -1.059410        3 -0.786692
4    4 -0.110475        4 -0.303009
5    5  0.792464        5  0.150692
6    6 -1.744959        6  2.088291
7    7  1.169675        7  0.911539
8    8 -1.835623        8  0.503609
9    9 -0.037064        9  1.105057
10 NaN       NaN       10 -0.342427
11 NaN       NaN       11  0.158631
12 NaN       NaN       12  1.780248

@shafiquejamal
Copy link

@jreback : Is there an easy way to get the _merge column in @makmanalp's comment? The NaNs could indicate _merge results, but they are ambiguous - they could have existed in the Series already.

@shafiquejamal
Copy link

@makmanalp: I've made a project to execute Stata-like commands with Pandas. This allows for Stata-like merges that produce a merge variable:

https://github.com/shafiquejamal/easyframes

@makmanalp
Copy link
Contributor Author

@jreback sorry to leave this thread hanging, looks like I'd missed your response. I think that does what I need for the most part. It does seem like a bit of a hack though.

@shafiquejamal Thanks, this is pretty neat. It'll come in handy when converting stata folks over.

@jreback
Copy link
Contributor

jreback commented Mar 24, 2016

@kobejohn not sure what you are asking about. can you show an example. I don't see anything being dropped.

@kobejohn
Copy link

Ok my apologies - I misinterpreted both my results and this thread. I'll delete the comment and then get more sleep.

@jreback
Copy link
Contributor

jreback commented Mar 24, 2016

np I think this issue was more about having an indicator where the merged column came FROM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

5 participants