Skip to content

Confusing interpretation of what DataFrame.join does #12188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
edublancas opened this issue Jan 31, 2016 · 4 comments
Closed

Confusing interpretation of what DataFrame.join does #12188

edublancas opened this issue Jan 31, 2016 · 4 comments
Labels
Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@edublancas
Copy link
Contributor

From the docs I see that the difference between .merge and .join is that .join operates using indexes by default, but it also lets you use columns, so I tried to use it for that since it sounds natural when coming from the SQL world.

From the docs:

on : column name, tuple/list of column names, or array-like
Column(s) to use for joining, otherwise join on index. If multiples columns given, the passed DataFrame must have a MultiIndex. Can pass an array as the join key if not already contained in the calling DataFrame. Like an Excel VLOOKUP operation

From my understanding, if on is absent a join operation is performed on the index, if on is present, it would be reasonable to think that the same operation would be performed.

Having said that:

a = pd.DataFrame({'address' : ["1820 SOME STREET", "32 ANOTHER STREET",
                              "1140 BIG STREET", "20 SMALL STREET",
                              "1090 AVENUE NAME"],
                  'id' : [1,2,3,4,5]
                  })
b = pd.DataFrame({'address' : ["1090 AVENUE NAME",
                              "1140 BIG STREET", "20 SMALL STREET",
                              "1820 SOME STREET", "32 ANOTHER STREET"],
                  'id' : [5,3,4,1,2]})

a.join(b, on='id', lsuffix='_x',  rsuffix='_y', how='left')

Output:

           address_x  id_x          address_y  id_y
0   1820 SOME STREET     1    1140 BIG STREET     3
1  32 ANOTHER STREET     2    20 SMALL STREET     4
2    1140 BIG STREET     3   1820 SOME STREET     1
3    20 SMALL STREET     4  32 ANOTHER STREET     2
4   1090 AVENUE NAME     5                NaN   NaN

And if I add id as index:

a.set_index('id', inplace=True)
b.set_index('id', inplace=True)
a.join(b, lsuffix='_x',  rsuffix='_y', how='left')

Output:

            address_x          address_y
id
1    1820 SOME STREET   1820 SOME STREET
2   32 ANOTHER STREET  32 ANOTHER STREET
3     1140 BIG STREET    1140 BIG STREET
4     20 SMALL STREET    20 SMALL STREET
5    1090 AVENUE NAME   1090 AVENUE NAME

Is that the correct behavior? If yes, I think the documentation is misleading. It took me a lot to find the bug in my code and I ended up using merge since .join works in an unexpected way.

I don't think I'm the only one with this issue, so maybe a change in the documentation would help to clarify how .join works.

@jreback
Copy link
Contributor

jreback commented Jan 31, 2016

xref to #6336

@jreback
Copy link
Contributor

jreback commented Jan 31, 2016

So the above join operation does this

In [3]: a.join(b, on='id', lsuffix='_x',  rsuffix='_y', how='left')
Out[3]: 
           address_x  id_x          address_y  id_y
0   1820 SOME STREET     1    1140 BIG STREET     3
1  32 ANOTHER STREET     2    20 SMALL STREET     4
2    1140 BIG STREET     3   1820 SOME STREET     1
3    20 SMALL STREET     4  32 ANOTHER STREET     2
4   1090 AVENUE NAME     5                NaN   NaN

this merge

In [4]: pd.merge(a, b, left_on='id', right_index=True, suffixes=('_x','_y'), how='left')
Out[4]: 
           address_x  id_x          address_y  id_y
0   1820 SOME STREET     1    1140 BIG STREET     3
1  32 ANOTHER STREET     2    20 SMALL STREET     4
2    1140 BIG STREET     3   1820 SOME STREET     1
3    20 SMALL STREET     4  32 ANOTHER STREET     2
4   1090 AVENUE NAME     5                NaN   NaN

Whereas I think you think it should do this one

In [5]: pd.merge(a, b, left_on='id', right_on='id', suffixes=('_x','_y'), how='left')
Out[5]: 
           address_x  id          address_y
0   1820 SOME STREET   1   1820 SOME STREET
1  32 ANOTHER STREET   2  32 ANOTHER STREET
2    1140 BIG STREET   3    1140 BIG STREET
3    20 SMALL STREET   4    20 SMALL STREET
4   1090 AVENUE NAME   5   1090 AVENUE NAME

I agree that is a bit confusing a bit, but I think the rationale is that you are joining from left, and the on controls what you are joining (on the left). There is no way in the .join to control the rhs, which is pre-supposed to be the index.

So this is what

In [9]: a.join(b.set_index('id'), on='id', lsuffix='_x',  rsuffix='_y', how='left')
Out[9]: 
           address_x  id          address_y
0   1820 SOME STREET   1   1820 SOME STREET
1  32 ANOTHER STREET   2  32 ANOTHER STREET
2    1140 BIG STREET   3    1140 BIG STREET
3    20 SMALL STREET   4    20 SMALL STREET
4   1090 AVENUE NAME   5   1090 AVENUE NAME

So I think a doc-update, maybe with some more examples would help?

@jreback jreback added Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 31, 2016
@jreback jreback added this to the Next Major Release milestone Jan 31, 2016
@edublancas
Copy link
Contributor Author

Thanks for the clarification! I agree that a documentation update will be really helpful.

Besides an update in the join method doc, a brief mention of this behavior would be useful at the bottom of Database-style DataFrame joining/merging section here.

@jreback
Copy link
Contributor

jreback commented Jan 31, 2016

sure - pull requests would be welcome!

@jreback jreback modified the milestones: 0.18.0, Next Major Release Feb 1, 2016
@jreback jreback modified the milestones: 0.18.1, 0.18.0 Feb 11, 2016
@jreback jreback modified the milestones: 0.18.2, 0.18.1 Apr 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants