Skip to content

Bug: Joining MultiIndex with NaNs #29252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hanudev opened this issue Oct 28, 2019 · 3 comments · Fixed by #48877
Closed

Bug: Joining MultiIndex with NaNs #29252

hanudev opened this issue Oct 28, 2019 · 3 comments · Fixed by #48877
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@hanudev
Copy link

hanudev commented Oct 28, 2019

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

df1 = pd.DataFrame(data={'col1': [1.1, 1.2]},
                   index=pd.MultiIndex.from_product([['A'], [1.0, 2.0]],
                                                    names=['id1', 'id2']))
df2 = pd.DataFrame(data={'col2': [2.1, 2.2]},
                   index=pd.MultiIndex.from_product([['A'], [np.NaN, 2.0]],
                                                    names=['id1', 'id2']))
print(df1.join(df2))

Problem description

The index containing the NaN value of df2 is falsely joined with the index of df1. Therefore, the result contains both values of df2 instead of only one value and a NaN value.

Expected Output

         col1  col2
id1 id2
A   1.0   1.1   NaN
    2.0   1.2   2.2

The expected result was returned by older pandas versions (successfully tested with 0.18.1 ; 0.20.3 ; 0.21.1 ; 0.22.0).

Actual Output

         col1  col2
id1 id2
A   1.0   1.1   2.1
    2.0   1.2   2.2

This result is returned by newer pandas versions, beginning with version 0.23.0 (tested with 0.23.0 ; 0.23.4 ; 0.24.2 ; 0.25.2).

I suppose this is a bug, or am I missing something?

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.2.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.2
numpy : 1.16.2
pytz : 2018.9
dateutil : 2.8.0
pip : 19.0.3
setuptools : 40.8.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

@simonjayhawkins
Copy link
Member

master produces different output

>>> import numpy as np
>>> import pandas as pd
>>>
>>> pd.__version__
'0.26.0.dev0+684.g953757a3e'
>>>
>>> df1 = pd.DataFrame(
...     data={"col1": [1.1, 1.2]},
...     index=pd.MultiIndex.from_product([["A"], [1.0, 2.0]], names=["id1", "id2"]),
... )
>>> df2 = pd.DataFrame(
...     data={"col2": [2.1, 2.2]},
...     index=pd.MultiIndex.from_product([["A"], [np.NaN, 2.0]], names=["id1", "id2"]),
... )
>>> print(df1.join(df2))
         col1  col2
id1 id2
A   1.0   1.1   NaN
    2.0   1.2   NaN
>>>

@simonjayhawkins simonjayhawkins added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 28, 2019
@mroeschke mroeschke added the Bug label May 11, 2020
@hanudev
Copy link
Author

hanudev commented May 12, 2020

In case someone else has the same problem, the following gives the expected output:

print(df1.join(df2, how='outer').loc[df1.index])

This is of course only a workaround, which is noticeably slower than a direct join.

@ezwelty
Copy link

ezwelty commented Aug 24, 2022

pandas 1.4.3

I just ran into this problem, and I would argue that it is more serious than first meets the eye. The behavior is inconsistent between pd.Index (correct) and pd.MultiIndex (wrong) and for pd.MultiIndex, between pd.DataFrame.join() (wrong) and pd.DataFrame.merge() (correct).

Say we want to join the following dataframes, and one has a missing value in the x column. (I use float64 here, but the behavior is the same with Float64, Int64, ... and pd.NA).

import pandas as pd

dfa = pd.DataFrame({
  'x': [1.0, float('nan')],
  'y': [1.0, 2.0]
})

dfb = pd.DataFrame({
  'x': [1.0, 2.0],
  'y': [1.0, 2.0],
  'z': [10.0, 20.0]
})

Joining on just x (with pd.Index) works as expected.

key = ['x']
ia = pd.Index(dfa['x'])
ib = pd.Index(dfb['x'])
ia.difference(ib).tolist()
# [nan]
ia.intersection(ib).tolist()
# [1.0]
dfa.set_index(key).join(dfb.set_index(key), how='inner', rsuffix='b')
#        y   yb     z
# x                  
# 1.0  1.0  1.0  10.0
dfa.merge(dfb, left_on=key, right_on=key, suffixes=('', 'b'))
#      x    y   yb     z
# 0  1.0  1.0  1.0  10.0

But joining on both x and y (with pd.MultiIndex) fails for pd.DataFrame.join somehow works correctly for pd.DataFrame.merge:

key = ['x', 'y']
ia = pd.MultiIndex.from_frame(dfa[key])
ib = pd.MultiIndex.from_frame(dfb[key])
ia.difference(ib).tolist()
# []
ia.intersection(ib)
# [(1.0, 1.0), (nan, 2.0)]
dfa.set_index(key).join(dfb.set_index(key), how='inner')
#             z
# x   y        
# 1.0 1.0  10.0
# NaN 2.0   NaN
dfa.merge(dfb, how='inner')
#      x    y     z
# 0  1.0  1.0  10.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants