Skip to content

merge() outer with left_on column and right_index=True produces unexpected results #33232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cwkwong opened this issue Apr 2, 2020 · 3 comments
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@cwkwong
Copy link
Contributor

cwkwong commented Apr 2, 2020

Apologies if this has been reported before, I searched through the issues but couldn't find a duplicate, IMHO.

Code Sample:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'lkey': [0, np.nan, np.nan, 3], 'value': [1, 2, 3, 5]})
>>> df1
   lkey  value
0   0.0      1
1   NaN      2
2   NaN      3
3   3.0      5


df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'rock', 'baz'], 'value': [0, 1, 2, 3]})
>>> df2
   rkey  value
0   foo      0
1   bar      1
2  rock      2
3   baz      3


res_1 = pd.merge(df1, df2, how='outer', suffixes=('_left', '_right'), left_on='lkey', right_index=True)
>>> res_1
     lkey  value_left  rkey  value_right
0.0   0.0         1.0   foo          0.0
1.0   NaN         2.0   NaN          NaN
2.0   NaN         3.0   NaN          NaN
3.0   3.0         5.0   baz          3.0
NaN   1.0         NaN   bar          1.0
NaN   2.0         NaN  rock          2.0


res_2 = pd.merge(df1, df2, how='outer', suffixes=('_left', '_right'), left_on='lkey', right_on='value')
>>> res_2
   lkey  value_left  rkey  value_right
0   0.0         1.0   foo          0.0
1   NaN         2.0   NaN          NaN
2   NaN         3.0   NaN          NaN
3   3.0         5.0   baz          3.0
4   NaN         NaN   bar          1.0
5   NaN         NaN  rock          2.0

Problem description

  • When merging with how=outer and right_index=True, result res_1 produces the last 2 rows with some unexpected values for the index and especially the lkey.
    • Last 2 rows have lkey values that don't seem to exist in the original left side df1.
    • Last 2 rows have np.nan for index. Was expecting perhaps [4.0, 5.0]
  • Compare this to res_2. Which is almost identical merge except now instead of right_index=True we use a column right_on='value'
    • the df2 index and value column have the same type and values.
  • I would expect seeing res_2 instead of res_1 when merging with right_index=True above.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8

pandas : 0.25.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.1.0
Cython : 0.29.11
pytest : 5.1.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 0.9.6
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.3 (dt dec pq3 ext lo64)
jinja2 : 2.9.6
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : 2.6.9
odfpy : None
openpyxl : 1.8.6
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.1.0
sqlalchemy : 1.3.12
tables : 3.5.1
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : 0.9.6

@simonjayhawkins
Copy link
Member

@cwkwong Thanks for the report. The code sample includes missing values in the lkey column which is used for the merge key. just to be clear, the issue also occurs without missing values in the merge column.

>>> import pandas as pd
>>>
>>> pd.__version__
'1.1.0.dev0+1068.g49bc8d8c9'
>>>
>>> df1 = pd.DataFrame({"lkey": [0, 3, 5, 6], "value": [1, 2, 3, 5]})
>>> df1
   lkey  value
0     0      1
1     3      2
2     5      3
3     6      5
>>>
>>> df2 = pd.DataFrame({"rkey": ["foo", "bar", "rock", "baz"], "value": [0, 1, 2, 3]})
>>> df2
   rkey  value
0   foo      0
1   bar      1
2  rock      2
3   baz      3
>>>
>>> pd.merge(
...     df1,
...     df2,
...     how="outer",
...     suffixes=("_left", "_right"),
...     left_on="lkey",
...     right_index=True,
... )
     lkey  value_left  rkey  value_right
0.0     0         1.0   foo          0.0
1.0     3         2.0   baz          3.0
2.0     5         3.0   NaN          NaN
3.0     6         5.0   NaN          NaN
NaN     1         NaN   bar          1.0
NaN     2         NaN  rock          2.0
>>>

@simonjayhawkins simonjayhawkins added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Apr 2, 2020
@simonjayhawkins
Copy link
Member

This looks similar to #17257 but with subtle differences so will keep this open for now.

@phofl
Copy link
Member

phofl commented May 28, 2020

@simonjayhawkins The desired behavior ist not completly clear to me. The documentation says, that indexes are passed along, if join is done on index. That is the case here.

Based on this I would either expecet a MultiIndex in the result which has the index of the left DataFrame as first component and the index of the right DataFrame as second component or the result we are seeing here.

The lkey values are wrong in the last two rows, no doubt about that. But this error has another origin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

3 participants