Confusing interpretation of what DataFrame.join does #12188

edublancas · 2016-01-31T02:25:51Z

From the docs I see that the difference between .merge and .join is that .join operates using indexes by default, but it also lets you use columns, so I tried to use it for that since it sounds natural when coming from the SQL world.

From the docs:

on : column name, tuple/list of column names, or array-like
Column(s) to use for joining, otherwise join on index. If multiples columns given, the passed DataFrame must have a MultiIndex. Can pass an array as the join key if not already contained in the calling DataFrame. Like an Excel VLOOKUP operation

From my understanding, if on is absent a join operation is performed on the index, if on is present, it would be reasonable to think that the same operation would be performed.

Having said that:

a = pd.DataFrame({'address' : ["1820 SOME STREET", "32 ANOTHER STREET",
                              "1140 BIG STREET", "20 SMALL STREET",
                              "1090 AVENUE NAME"],
                  'id' : [1,2,3,4,5]
                  })
b = pd.DataFrame({'address' : ["1090 AVENUE NAME",
                              "1140 BIG STREET", "20 SMALL STREET",
                              "1820 SOME STREET", "32 ANOTHER STREET"],
                  'id' : [5,3,4,1,2]})

a.join(b, on='id', lsuffix='_x',  rsuffix='_y', how='left')

Output:

           address_x  id_x          address_y  id_y
0   1820 SOME STREET     1    1140 BIG STREET     3
1  32 ANOTHER STREET     2    20 SMALL STREET     4
2    1140 BIG STREET     3   1820 SOME STREET     1
3    20 SMALL STREET     4  32 ANOTHER STREET     2
4   1090 AVENUE NAME     5                NaN   NaN

And if I add id as index:

a.set_index('id', inplace=True)
b.set_index('id', inplace=True)
a.join(b, lsuffix='_x',  rsuffix='_y', how='left')

Output:

            address_x          address_y
id
1    1820 SOME STREET   1820 SOME STREET
2   32 ANOTHER STREET  32 ANOTHER STREET
3     1140 BIG STREET    1140 BIG STREET
4     20 SMALL STREET    20 SMALL STREET
5    1090 AVENUE NAME   1090 AVENUE NAME

Is that the correct behavior? If yes, I think the documentation is misleading. It took me a lot to find the bug in my code and I ended up using merge since .join works in an unexpected way.

I don't think I'm the only one with this issue, so maybe a change in the documentation would help to clarify how .join works.

The text was updated successfully, but these errors were encountered:

jreback · 2016-01-31T14:55:56Z

xref to #6336

jreback · 2016-01-31T15:10:51Z

So the above join operation does this

In [3]: a.join(b, on='id', lsuffix='_x',  rsuffix='_y', how='left')
Out[3]: 
           address_x  id_x          address_y  id_y
0   1820 SOME STREET     1    1140 BIG STREET     3
1  32 ANOTHER STREET     2    20 SMALL STREET     4
2    1140 BIG STREET     3   1820 SOME STREET     1
3    20 SMALL STREET     4  32 ANOTHER STREET     2
4   1090 AVENUE NAME     5                NaN   NaN

this merge

In [4]: pd.merge(a, b, left_on='id', right_index=True, suffixes=('_x','_y'), how='left')
Out[4]: 
           address_x  id_x          address_y  id_y
0   1820 SOME STREET     1    1140 BIG STREET     3
1  32 ANOTHER STREET     2    20 SMALL STREET     4
2    1140 BIG STREET     3   1820 SOME STREET     1
3    20 SMALL STREET     4  32 ANOTHER STREET     2
4   1090 AVENUE NAME     5                NaN   NaN

Whereas I think you think it should do this one

In [5]: pd.merge(a, b, left_on='id', right_on='id', suffixes=('_x','_y'), how='left')
Out[5]: 
           address_x  id          address_y
0   1820 SOME STREET   1   1820 SOME STREET
1  32 ANOTHER STREET   2  32 ANOTHER STREET
2    1140 BIG STREET   3    1140 BIG STREET
3    20 SMALL STREET   4    20 SMALL STREET
4   1090 AVENUE NAME   5   1090 AVENUE NAME

I agree that is a bit confusing a bit, but I think the rationale is that you are joining from left, and the on controls what you are joining (on the left). There is no way in the .join to control the rhs, which is pre-supposed to be the index.

So this is what

In [9]: a.join(b.set_index('id'), on='id', lsuffix='_x',  rsuffix='_y', how='left')
Out[9]: 
           address_x  id          address_y
0   1820 SOME STREET   1   1820 SOME STREET
1  32 ANOTHER STREET   2  32 ANOTHER STREET
2    1140 BIG STREET   3    1140 BIG STREET
3    20 SMALL STREET   4    20 SMALL STREET
4   1090 AVENUE NAME   5   1090 AVENUE NAME

So I think a doc-update, maybe with some more examples would help?

edublancas · 2016-01-31T16:44:45Z

Thanks for the clarification! I agree that a documentation update will be really helpful.

Besides an update in the join method doc, a brief mention of this behavior would be useful at the bottom of Database-style DataFrame joining/merging section here.

jreback · 2016-01-31T16:53:06Z

sure - pull requests would be welcome!

jreback added Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 31, 2016

jreback added this to the Next Major Release milestone Jan 31, 2016

jreback added Difficulty Novice labels Jan 31, 2016

edublancas mentioned this issue Feb 1, 2016

Improved documentation for DataFrame.join #12193

Closed

jreback modified the milestones: 0.18.0, Next Major Release Feb 1, 2016

jreback modified the milestones: 0.18.1, 0.18.0 Feb 11, 2016

jreback modified the milestones: 0.18.2, 0.18.1 Apr 18, 2016

jreback closed this as completed in 57ea76f May 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusing interpretation of what DataFrame.join does #12188

Confusing interpretation of what DataFrame.join does #12188

edublancas commented Jan 31, 2016

jreback commented Jan 31, 2016

jreback commented Jan 31, 2016

edublancas commented Jan 31, 2016

jreback commented Jan 31, 2016

Confusing interpretation of what DataFrame.join does #12188

Confusing interpretation of what DataFrame.join does #12188

Comments

edublancas commented Jan 31, 2016

jreback commented Jan 31, 2016

jreback commented Jan 31, 2016

edublancas commented Jan 31, 2016

jreback commented Jan 31, 2016