Bug: Joining MultiIndex with NaNs #29252

hanudev · 2019-10-28T13:39:12Z

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

df1 = pd.DataFrame(data={'col1': [1.1, 1.2]},
                   index=pd.MultiIndex.from_product([['A'], [1.0, 2.0]],
                                                    names=['id1', 'id2']))
df2 = pd.DataFrame(data={'col2': [2.1, 2.2]},
                   index=pd.MultiIndex.from_product([['A'], [np.NaN, 2.0]],
                                                    names=['id1', 'id2']))
print(df1.join(df2))

Problem description

The index containing the NaN value of df2 is falsely joined with the index of df1. Therefore, the result contains both values of df2 instead of only one value and a NaN value.

Expected Output

         col1  col2
id1 id2
A   1.0   1.1   NaN
    2.0   1.2   2.2

The expected result was returned by older pandas versions (successfully tested with 0.18.1 ; 0.20.3 ; 0.21.1 ; 0.22.0).

Actual Output

         col1  col2
id1 id2
A   1.0   1.1   2.1
    2.0   1.2   2.2

This result is returned by newer pandas versions, beginning with version 0.23.0 (tested with 0.23.0 ; 0.23.4 ; 0.24.2 ; 0.25.2).

I suppose this is a bug, or am I missing something?

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.2.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.2
numpy : 1.16.2
pytz : 2018.9
dateutil : 2.8.0
pip : 19.0.3
setuptools : 40.8.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

The text was updated successfully, but these errors were encountered:

simonjayhawkins · 2019-10-28T14:40:04Z

master produces different output

>>> import numpy as np
>>> import pandas as pd
>>>
>>> pd.__version__
'0.26.0.dev0+684.g953757a3e'
>>>
>>> df1 = pd.DataFrame(
...     data={"col1": [1.1, 1.2]},
...     index=pd.MultiIndex.from_product([["A"], [1.0, 2.0]], names=["id1", "id2"]),
... )
>>> df2 = pd.DataFrame(
...     data={"col2": [2.1, 2.2]},
...     index=pd.MultiIndex.from_product([["A"], [np.NaN, 2.0]], names=["id1", "id2"]),
... )
>>> print(df1.join(df2))
         col1  col2
id1 id2
A   1.0   1.1   NaN
    2.0   1.2   NaN
>>>

hanudev · 2020-05-12T10:31:18Z

In case someone else has the same problem, the following gives the expected output:

print(df1.join(df2, how='outer').loc[df1.index])

This is of course only a workaround, which is noticeably slower than a direct join.

ezwelty · 2022-08-24T11:55:13Z

pandas 1.4.3

I just ran into this problem, and I would argue that it is more serious than first meets the eye. The behavior is inconsistent between pd.Index (correct) and pd.MultiIndex (wrong) and for pd.MultiIndex, between pd.DataFrame.join() (wrong) and pd.DataFrame.merge() (correct).

Say we want to join the following dataframes, and one has a missing value in the x column. (I use float64 here, but the behavior is the same with Float64, Int64, ... and pd.NA).

import pandas as pd

dfa = pd.DataFrame({
  'x': [1.0, float('nan')],
  'y': [1.0, 2.0]
})

dfb = pd.DataFrame({
  'x': [1.0, 2.0],
  'y': [1.0, 2.0],
  'z': [10.0, 20.0]
})

Joining on just x (with pd.Index) works as expected.

key = ['x']
ia = pd.Index(dfa['x'])
ib = pd.Index(dfb['x'])
ia.difference(ib).tolist()
# [nan]
ia.intersection(ib).tolist()
# [1.0]
dfa.set_index(key).join(dfb.set_index(key), how='inner', rsuffix='b')
#        y   yb     z
# x                  
# 1.0  1.0  1.0  10.0
dfa.merge(dfb, left_on=key, right_on=key, suffixes=('', 'b'))
#      x    y   yb     z
# 0  1.0  1.0  1.0  10.0

But joining on both x and y (with pd.MultiIndex) fails for pd.DataFrame.join somehow works correctly for pd.DataFrame.merge:

key = ['x', 'y']
ia = pd.MultiIndex.from_frame(dfa[key])
ib = pd.MultiIndex.from_frame(dfb[key])
ia.difference(ib).tolist()
# []
ia.intersection(ib)
# [(1.0, 1.0), (nan, 2.0)]
dfa.set_index(key).join(dfb.set_index(key), how='inner')
#             z
# x   y        
# 1.0 1.0  10.0
# NaN 2.0   NaN
dfa.merge(dfb, how='inner')
#      x    y     z
# 0  1.0  1.0  10.0

simonjayhawkins added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 28, 2019

simonjayhawkins added the MultiIndex label Apr 4, 2020

mroeschke added the Bug label May 11, 2020

gertjac mentioned this issue May 22, 2020

BUG: wrong join with nan in multiindex #34035

Closed

3 tasks

theemathas mentioned this issue Oct 6, 2020

BUG: Left join is broken with a MultiIndex with only one level #36909

Closed

3 tasks

lukemanley mentioned this issue Sep 30, 2022

TST: MultiIndex.get_indexer with na/missing #48877

Merged

10 tasks

phofl closed this as completed in #48877 Nov 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Joining MultiIndex with NaNs #29252

Bug: Joining MultiIndex with NaNs #29252

hanudev commented Oct 28, 2019

INSTALLED VERSIONS

simonjayhawkins commented Oct 28, 2019

hanudev commented May 12, 2020

ezwelty commented Aug 24, 2022

Bug: Joining MultiIndex with NaNs #29252

Bug: Joining MultiIndex with NaNs #29252

Comments

hanudev commented Oct 28, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Actual Output

Output of pd.show_versions()

INSTALLED VERSIONS

simonjayhawkins commented Oct 28, 2019

hanudev commented May 12, 2020

ezwelty commented Aug 24, 2022

Output of `pd.show_versions()`