merge() outer with left_on column and right_index=True produces unexpected results #33232

cwkwong · 2020-04-02T03:54:22Z

Apologies if this has been reported before, I searched through the issues but couldn't find a duplicate, IMHO.

Code Sample:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'lkey': [0, np.nan, np.nan, 3], 'value': [1, 2, 3, 5]})
>>> df1
   lkey  value
0   0.0      1
1   NaN      2
2   NaN      3
3   3.0      5


df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'rock', 'baz'], 'value': [0, 1, 2, 3]})
>>> df2
   rkey  value
0   foo      0
1   bar      1
2  rock      2
3   baz      3


res_1 = pd.merge(df1, df2, how='outer', suffixes=('_left', '_right'), left_on='lkey', right_index=True)
>>> res_1
     lkey  value_left  rkey  value_right
0.0   0.0         1.0   foo          0.0
1.0   NaN         2.0   NaN          NaN
2.0   NaN         3.0   NaN          NaN
3.0   3.0         5.0   baz          3.0
NaN   1.0         NaN   bar          1.0
NaN   2.0         NaN  rock          2.0


res_2 = pd.merge(df1, df2, how='outer', suffixes=('_left', '_right'), left_on='lkey', right_on='value')
>>> res_2
   lkey  value_left  rkey  value_right
0   0.0         1.0   foo          0.0
1   NaN         2.0   NaN          NaN
2   NaN         3.0   NaN          NaN
3   3.0         5.0   baz          3.0
4   NaN         NaN   bar          1.0
5   NaN         NaN  rock          2.0

Problem description

When merging with how=outer and right_index=True, result res_1 produces the last 2 rows with some unexpected values for the index and especially the lkey.
- Last 2 rows have lkey values that don't seem to exist in the original left side df1.
- Last 2 rows have np.nan for index. Was expecting perhaps [4.0, 5.0]
Compare this to res_2. Which is almost identical merge except now instead of right_index=True we use a column right_on='value'
- the df2 index and value column have the same type and values.
I would expect seeing res_2 instead of res_1 when merging with right_index=True above.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8

pandas : 0.25.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.1.0
Cython : 0.29.11
pytest : 5.1.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 0.9.6
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.3 (dt dec pq3 ext lo64)
jinja2 : 2.9.6
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : 2.6.9
odfpy : None
openpyxl : 1.8.6
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.1.0
sqlalchemy : 1.3.12
tables : 3.5.1
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : 0.9.6

The text was updated successfully, but these errors were encountered:

simonjayhawkins · 2020-04-02T14:16:51Z

@cwkwong Thanks for the report. The code sample includes missing values in the lkey column which is used for the merge key. just to be clear, the issue also occurs without missing values in the merge column.

>>> import pandas as pd
>>>
>>> pd.__version__
'1.1.0.dev0+1068.g49bc8d8c9'
>>>
>>> df1 = pd.DataFrame({"lkey": [0, 3, 5, 6], "value": [1, 2, 3, 5]})
>>> df1
   lkey  value
0     0      1
1     3      2
2     5      3
3     6      5
>>>
>>> df2 = pd.DataFrame({"rkey": ["foo", "bar", "rock", "baz"], "value": [0, 1, 2, 3]})
>>> df2
   rkey  value
0   foo      0
1   bar      1
2  rock      2
3   baz      3
>>>
>>> pd.merge(
...     df1,
...     df2,
...     how="outer",
...     suffixes=("_left", "_right"),
...     left_on="lkey",
...     right_index=True,
... )
     lkey  value_left  rkey  value_right
0.0     0         1.0   foo          0.0
1.0     3         2.0   baz          3.0
2.0     5         3.0   NaN          NaN
3.0     6         5.0   NaN          NaN
NaN     1         NaN   bar          1.0
NaN     2         NaN  rock          2.0
>>>

simonjayhawkins · 2020-04-02T14:32:00Z

This looks similar to #17257 but with subtle differences so will keep this open for now.

phofl · 2020-05-28T20:36:30Z

@simonjayhawkins The desired behavior ist not completly clear to me. The documentation says, that indexes are passed along, if join is done on index. That is the case here.

Based on this I would either expecet a MultiIndex in the result which has the index of the left DataFrame as first component and the index of the right DataFrame as second component or the result we are seeing here.

The lkey values are wrong in the last two rows, no doubt about that. But this error has another origin.

simonjayhawkins added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Apr 2, 2020

phofl mentioned this issue May 31, 2020

BUG: Fix value setting in case of merging via index on one side and column on other side. #34496

Closed

9 tasks

debnathshoham mentioned this issue Oct 28, 2022

BUG: join on column with index #49360

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge() outer with left_on column and right_index=True produces unexpected results #33232

merge() outer with left_on column and right_index=True produces unexpected results #33232

cwkwong commented Apr 2, 2020

INSTALLED VERSIONS

simonjayhawkins commented Apr 2, 2020

simonjayhawkins commented Apr 2, 2020

phofl commented May 28, 2020

merge() outer with left_on column and right_index=True produces unexpected results #33232

merge() outer with left_on column and right_index=True produces unexpected results #33232

Comments

cwkwong commented Apr 2, 2020

Code Sample:

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

simonjayhawkins commented Apr 2, 2020

simonjayhawkins commented Apr 2, 2020

phofl commented May 28, 2020

Output of `pd.show_versions()`