Skip to content

BUG: Merge fails on nullable integer columns when dtypes are unequal #46799

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
NickCrews opened this issue Apr 18, 2022 · 6 comments
Closed
3 tasks done

BUG: Merge fails on nullable integer columns when dtypes are unequal #46799

NickCrews opened this issue Apr 18, 2022 · 6 comments
Assignees
Labels
Bug Duplicate Report Duplicate issue or pull request Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@NickCrews
Copy link
Contributor

NickCrews commented Apr 18, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd

null_keys = pd.DataFrame(
    {
        "key": pd.Series([0, 1, pd.NA], dtype=pd.Int64Dtype()),
        "val1": pd.Series(["a", "b", "c"]),
    }
)
non_null_keys = pd.DataFrame(
    {
        "key": pd.Series([0, 1, 2], dtype=np.int64),  # This fails!
        # "key": pd.Series([0, 1, 2], dtype=pd.UInt64Dtype()),  # This also fails!
        # "key": pd.Series([0, 1, 2], dtype=pd.Int64Dtype()),  # This works!
        "val2": pd.Series(["d", "e", "f"]),
    }
)
print(pd.merge(null_keys, non_null_keys, how="left", on="key").to_markdown())

Issue Description

If you try to merge two (or more, probably) objects, and all of the following are true

  • col1 is a nullable integer type (eg pd.Int64Dtype())
  • col1 contains nulls
  • col2 is an integer type (either nullable eg pd.Int64Dtype() or non-nullable eg np.int64)
  • col1.dtype != col2.dtype
    then the merge fails with:
Traceback (most recent call last):
  File "test.py", line 20, in <module>
    pd.merge(null_keys, non_null_keys, on="key")
  File "/workspaces/pandas/pandas/core/reshape/merge.py", line 123, in merge
    return op.get_result()
  File "/workspaces/pandas/pandas/core/reshape/merge.py", line 717, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "/workspaces/pandas/pandas/core/reshape/merge.py", line 968, in _get_join_info
    (left_indexer, right_indexer) = self._get_join_indexers()
  File "/workspaces/pandas/pandas/core/reshape/merge.py", line 942, in _get_join_indexers
    return get_join_indexers(
  File "/workspaces/pandas/pandas/core/reshape/merge.py", line 1497, in get_join_indexers
    zipped = zip(*mapped)
  File "/workspaces/pandas/pandas/core/reshape/merge.py", line 1494, in <genexpr>
    _factorize_keys(left_keys[n], right_keys[n], sort=sort, how=how)
  File "/workspaces/pandas/pandas/core/reshape/merge.py", line 2227, in _factorize_keys
    lk = ensure_int64(np.asarray(lk))
  File "pandas/_libs/algos_common_helper.pxi", line 81, in pandas._libs.algos.ensure_int64
    return arr.astype(np.int64, copy=copy)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NAType'

However, if I change the np.int64 dtype to pd.Int64Dtype() as well (eg both types are equal), then the merge works fine. (Note that changing both types to non-nullable types isn't applicable, because some of the keys are indeed null so that would be impossible). Also fails if one is Int64 and the other is UInt64. Tested on main.

The offending code is probably in pandas/core/reshape/merge.py, line 2227, in _factorize_keys, but I didn't dig any deeper than that.

Expected Behavior

The merge should act the same as when both keys are nullable and dtypes are equal (e.g. both Int64Dtype):

key val1 val2
0 0 a d
1 1 b e
2 nan c nan

Installed Versions

INSTALLED VERSIONS

commit : ec75436
python : 3.8.12.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.47-linuxkit
Version : #1 SMP Sat Jul 3 21:51:47 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : None.None

pandas : 1.5.0.dev0+688.gec7543622a
numpy : 1.22.2
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 60.9.3
Cython : 0.29.28
pytest : 7.0.1
hypothesis : 6.39.1
sphinx : 4.3.2
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.8.0
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.1.1
pandas_datareader: None
bs4 : 4.10.0
bottleneck : 1.3.4
brotli :
fastparquet : 0.8.0
fsspec : 2021.11.0
gcsfs : 2021.11.0
markupsafe : 2.1.0
matplotlib : 3.5.1
numba : 0.53.1
numexpr : 2.8.0
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 7.0.0
pyreadstat : 1.1.4
pyxlsb : None
s3fs : 2021.11.0
scipy : 1.8.0
snappy :
sqlalchemy : 1.4.31
tables : 3.7.0
tabulate : 0.8.9
xarray : 0.18.2
xlrd : 2.0.1
xlwt : 1.3.0
zstandard : None

@NickCrews NickCrews added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 18, 2022
@NickCrews NickCrews changed the title BUG: Merge fails on mix of nullable and non-nullable integer columns BUG: Merge fails on nullable integer columns when dtypes are unequal Apr 18, 2022
@pranavc28
Copy link

take

@NickCrews
Copy link
Contributor Author

Hey @pranavc28 , thanks for taking this! Just FYI, you probably want to get confirmation from a maintainer that what I wrote as the "expected behavior" is actually the goal. You don't want to do the work to write a PR and then learn later that you didn't actually solve the problem the right way :)

@phofl
Copy link
Member

phofl commented Apr 22, 2022

@NickCrews this looks like a duplicate of #46178?

@NickCrews
Copy link
Contributor Author

This is a duplicate of #46178

@phofl
Copy link
Member

phofl commented Apr 22, 2022

Thx

@NickCrews
Copy link
Contributor Author

@phofl I don't have permissions to add the duplicate label, IDK if you want to. Thanks!

@phofl phofl added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Duplicate Report Duplicate issue or pull request and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Duplicate Report Duplicate issue or pull request Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

3 participants