You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When merging with how=outer and right_index=True, result res_1 produces the last 2 rows with some unexpected values for the index and especially the lkey.
Last 2 rows have lkey values that don't seem to exist in the original left side df1.
Last 2 rows have np.nan for index. Was expecting perhaps [4.0, 5.0]
Compare this to res_2. Which is almost identical merge except now instead of right_index=True we use a column right_on='value'
the df2index and value column have the same type and values.
I would expect seeing res_2 instead of res_1 when merging with right_index=True above.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8
@cwkwong Thanks for the report. The code sample includes missing values in the lkey column which is used for the merge key. just to be clear, the issue also occurs without missing values in the merge column.
>>> import pandas as pd
>>>
>>> pd.__version__
'1.1.0.dev0+1068.g49bc8d8c9'
>>>
>>> df1 = pd.DataFrame({"lkey": [0, 3, 5, 6], "value": [1, 2, 3, 5]})
>>> df1
lkey value
0 0 1
1 3 2
2 5 3
3 6 5
>>>
>>> df2 = pd.DataFrame({"rkey": ["foo", "bar", "rock", "baz"], "value": [0, 1, 2, 3]})
>>> df2
rkey value
0 foo 0
1 bar 1
2 rock 2
3 baz 3
>>>
>>> pd.merge(
... df1,
... df2,
... how="outer",
... suffixes=("_left", "_right"),
... left_on="lkey",
... right_index=True,
... )
lkey value_left rkey value_right
0.0 0 1.0 foo 0.0
1.0 3 2.0 baz 3.0
2.0 5 3.0 NaN NaN
3.0 6 5.0 NaN NaN
NaN 1 NaN bar 1.0
NaN 2 NaN rock 2.0
>>>
@simonjayhawkins The desired behavior ist not completly clear to me. The documentation says, that indexes are passed along, if join is done on index. That is the case here.
Based on this I would either expecet a MultiIndex in the result which has the index of the left DataFrame as first component and the index of the right DataFrame as second component or the result we are seeing here.
The lkey values are wrong in the last two rows, no doubt about that. But this error has another origin.
Apologies if this has been reported before, I searched through the issues but couldn't find a duplicate, IMHO.
Code Sample:
Problem description
how=outer
andright_index=True
, resultres_1
produces the last 2 rows with some unexpected values for theindex
and especially thelkey
.lkey
values that don't seem to exist in the original left sidedf1
.np.nan
forindex
. Was expecting perhaps[4.0, 5.0]
res_2
. Which is almost identicalmerge
except now instead ofright_index=True
we use a columnright_on='value'
df2
index
andvalue
column have the same type and values.res_2
instead ofres_1
when merging withright_index=True
above.Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8
pandas : 0.25.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.1.0
Cython : 0.29.11
pytest : 5.1.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 0.9.6
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.3 (dt dec pq3 ext lo64)
jinja2 : 2.9.6
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : 2.6.9
odfpy : None
openpyxl : 1.8.6
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.1.0
sqlalchemy : 1.3.12
tables : 3.5.1
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : 0.9.6
The text was updated successfully, but these errors were encountered: