Skip to content

BUG: Merge with EA and MultiIndex #43734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
benoit9126 opened this issue Sep 24, 2021 · 2 comments · Fixed by #43785
Closed
3 tasks done

BUG: Merge with EA and MultiIndex #43734

benoit9126 opened this issue Sep 24, 2021 · 2 comments · Fixed by #43785
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@benoit9126
Copy link
Contributor

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd

df1 = pd.DataFrame(data={("lvl0","lvl1-a"): ["1","2","3"], ("lvl0","lvl1-b"): ["4","5","6"]}, dtype=str)

df2 = pd.DataFrame(data={("lvl0","lvl1-a"): ["1","2","3"], ("lvl0","lvl1-c"): ["7","8","9"]}, dtype=pd.StringDtype())

result = pd.merge(left=df1, right=df2, on=[("lvl0","lvl1-a")])

Issue Description

I am trying to make a merge between two data frames with a MultiIndex for the columns. The first data frame has an object data type (str in my example) and the second an extension array data type (pd.StringDtype() in my example). In this case, in an internal method, an assign method is used but unfortunately, the name of the column is not a string but a tuple...

Expected Behavior

I would have expected the merge to happen as it is with two str dtype data frames or two pd.StirngDtype() dtype data frames.

    lvl0              
  lvl1-a lvl1-b lvl1-c
0      1      4      7
1      2      5      8
2      3      6      9

Installed Versions

INSTALLED VERSIONS

commit : 73c6825
python : 3.9.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.0-8-amd64
Version : #1 SMP Debian 5.10.46-4 (2021-08-03)
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : fr_FR.UTF-8
LOCALE : fr_FR.UTF-8
pandas : 1.3.3
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.4
setuptools : 57.4.0
Cython : 0.29.24
pytest : 6.2.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 3.0.1
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : 1.4.25
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@benoit9126 benoit9126 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 24, 2021
@debnathshoham debnathshoham added MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Sep 24, 2021
@debnathshoham
Copy link
Member

Thanks @benoit9126 for the report!
What would be the expected dtype of ('lvl0', 'lvl1-a')?

@benoit9126
Copy link
Contributor Author

benoit9126 commented Sep 24, 2021

If, by instance, you take data frames with integer (the first with int and the other with pd.Int64Dtype()), the previous example works without troubles and the resulting data frame has columns of type pd.Int64Dtype numpy.dtype[int64]
(all columns)

import pandas as pd
df1 = pd.DataFrame(data={("lvl0","lvl1-a"): [1,2,3], ("lvl0","lvl1-b"): [4,5,6]}, dtype=int)
df2 = pd.DataFrame(data={("lvl0","lvl1-a"): [1,2,3], ("lvl0","lvl1-c"): [7,8,9]}, 
                   dtype=pd.Int64Dtype())
result = pd.merge(left=df1, right=df2, on=[("lvl0","lvl1-a")])
result.dtypes
# lvl0  lvl1-a    int64
#       lvl1-b    int64
#       lvl1-c    Int64
# dtype: object

I think this is the safest option as there can be a None, pd.NA, np.nan in the resulting columns that are not in the input data frame with object dtype. Here, an object data type for the resulting key column may be sufficient

@mroeschke mroeschke added ExtensionArray Extending pandas with custom dtypes or arrays. and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 1, 2021
@jreback jreback added this to the 1.4 milestone Oct 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants