Option to keep left/right join columns (or add _merge column) to merge() and concat() #7412

makmanalp · 2014-06-09T16:54:12Z

Hello!

I just heard from a colleague that they're looking for the analogue of STATA's merge command (http://www.stata.com/help.cgi?merge) which generates a _merge column that includes a code which specifies in an outer join whether the row existed in the right table, the left table or both. I know you can hack your way around this by doing set operations on the join columns / indices or creating new columns, but there could be an argument for having this be included functionality if it could be done simultaneously during the merge or just for sheer convenience.

The use case specified was that after they merged, they were checking over the data to find inconsistencies and rows that should have been merged but somehow didn't.

Let me know if there would be any interest in this, and I could maybe have a first shot at implementing it.

The text was updated successfully, but these errors were encountered:

jreback · 2014-06-09T16:56:03Z

pls specify an example of the input and output using simple copy-pastable code, much easier for people to play with / get solns.

makmanalp · 2014-06-09T17:06:11Z

Sure thing:

from pandas import DataFrame, merge
import numpy as np

df = DataFrame(np.random.randn(10, 2), columns=["id", "sex"])
df2 = DataFrame(np.random.randn(10, 2), columns=["user_id", "name"])
df.id = range(10)
df2.user_id = range(3,13)

merge(df, df2, left_on="id", right_on="user_id",  how="outer")
""" Returns:
    id       sex  user_id      name
0    0 -0.254309      NaN       NaN
1    1 -0.363123      NaN       NaN
2    2 -0.408873      NaN       NaN
3    3 -1.209845        3  0.578440
4    4  0.952290        4 -1.336396
5    5 -0.091704        5  0.255794
6    6  0.984578        6 -0.469222
7    7 -0.694126        7  1.197256
8    8  0.369942        8 -0.656366
9    9  1.544090        9 -0.975548
10 NaN       NaN       10 -1.827958
11 NaN       NaN       11 -1.523407
12 NaN       NaN       12 -0.785032
"""

# But when merging on index / same column, you can't do this because the merge column gets joined into one and you lose all merge source information

merge(df.set_index("id"), df2.set_index("user_id"), left_index=True, right_index=True,  how="outer")
"""
         sex      name
0  -0.254309       NaN
1  -0.363123       NaN
2  -0.408873       NaN
3  -1.209845  0.578440
4   0.952290 -1.336396
5  -0.091704  0.255794
6   0.984578 -0.469222
7  -0.694126  1.197256
8   0.369942 -0.656366
9   1.544090 -0.975548
10       NaN -1.827958
11       NaN -1.523407
12       NaN -0.785032
"""

# What you'd want after setting merge_info=True

"""
         sex      name    _merge
0  -0.254309       NaN         1
1  -0.363123       NaN         1
2  -0.408873       NaN         1
3  -1.209845  0.578440         3
4   0.952290 -1.336396         3
5  -0.091704  0.255794         3
6   0.984578 -0.469222         3
7  -0.694126  1.197256         3
8   0.369942 -0.656366         3
9   1.544090 -0.975548         3
10       NaN -1.827958         2
11       NaN -1.523407         2
12       NaN -0.785032         2
"""

jreback · 2014-06-09T17:45:51Z

work for you?

In [14]: pd.merge(df.set_index("id",drop=False), df2.set_index("user_id",drop=False), left_index=True, right_index=True,  how="outer")
Out[14]: 
    id       sex  user_id      name
0    0  1.365492      NaN       NaN
1    1 -0.598057      NaN       NaN
2    2 -1.092018      NaN       NaN
3    3 -1.059410        3 -0.786692
4    4 -0.110475        4 -0.303009
5    5  0.792464        5  0.150692
6    6 -1.744959        6  2.088291
7    7  1.169675        7  0.911539
8    8 -1.835623        8  0.503609
9    9 -0.037064        9  1.105057
10 NaN       NaN       10 -0.342427
11 NaN       NaN       11  0.158631
12 NaN       NaN       12  1.780248

shafiquejamal · 2014-07-22T20:53:35Z

@jreback : Is there an easy way to get the _merge column in @makmanalp's comment? The NaNs could indicate _merge results, but they are ambiguous - they could have existed in the Series already.

shafiquejamal · 2014-07-23T00:32:50Z

@makmanalp: I've made a project to execute Stata-like commands with Pandas. This allows for Stata-like merges that produce a merge variable:

https://github.com/shafiquejamal/easyframes

makmanalp · 2014-07-23T03:18:11Z

@jreback sorry to leave this thread hanging, looks like I'd missed your response. I think that does what I need for the most part. It does seem like a bit of a hack though.

@shafiquejamal Thanks, this is pretty neat. It'll come in handy when converting stata folks over.

jreback · 2016-03-24T15:44:37Z

@kobejohn not sure what you are asking about. can you show an example. I don't see anything being dropped.

kobejohn · 2016-03-24T18:26:36Z

Ok my apologies - I misinterpreted both my results and this thread. I'll delete the comment and then get more sleep.

jreback · 2016-03-24T18:45:15Z

np I think this issue was more about having an indicator where the merged column came FROM

makmanalp changed the title ~~Add option to keep left / right join columns on merge() and concat()~~ Option to keep left/right join columns (or add _merge column) to merge() and concat() Jun 9, 2014

makmanalp closed this as completed Jul 23, 2014

jreback mentioned this issue Nov 11, 2014

Feature Request: row-level Merge Status Variable #8790

Closed

nickeubank mentioned this issue May 3, 2015

ENH: Create merge indicator for obs from left, right, or both #10054

Merged

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design labels May 10, 2015

jreback added this to the 0.17.0 milestone May 10, 2015

jorisvandenbossche modified the milestones: 0.17.0, 0.16.2 Jun 2, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to keep left/right join columns (or add _merge column) to merge() and concat() #7412

Option to keep left/right join columns (or add _merge column) to merge() and concat() #7412

makmanalp commented Jun 9, 2014

jreback commented Jun 9, 2014

makmanalp commented Jun 9, 2014

jreback commented Jun 9, 2014

shafiquejamal commented Jul 22, 2014

shafiquejamal commented Jul 23, 2014

makmanalp commented Jul 23, 2014

jreback commented Mar 24, 2016

kobejohn commented Mar 24, 2016

jreback commented Mar 24, 2016

Option to keep left/right join columns (or add _merge column) to merge() and concat() #7412

Option to keep left/right join columns (or add _merge column) to merge() and concat() #7412

Comments

makmanalp commented Jun 9, 2014

jreback commented Jun 9, 2014

makmanalp commented Jun 9, 2014

jreback commented Jun 9, 2014

shafiquejamal commented Jul 22, 2014

shafiquejamal commented Jul 23, 2014

makmanalp commented Jul 23, 2014

jreback commented Mar 24, 2016

kobejohn commented Mar 24, 2016

jreback commented Mar 24, 2016