Skip to content

BUG: Merge error when merge col is int64 and Int64 #46178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
tritemio opened this issue Feb 28, 2022 · 9 comments · Fixed by #54755
Closed
2 of 3 tasks

BUG: Merge error when merge col is int64 and Int64 #46178

tritemio opened this issue Feb 28, 2022 · 9 comments · Fixed by #54755
Labels
Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@tritemio
Copy link

tritemio commented Feb 28, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

# First DF
df1 = pd.DataFrame({'col1': [1,1,2,2,3,pd.NA]})
df1['col1'] = df1['col1'].astype('Int64')

# Second DF
df2 = pd.DataFrame({'col1': [1,2,3], 'col2': list('abc')}).set_index('col1')

print(df1.dtypes)
print(df2.dtypes)
print(df1.merge(df2, left_on='col1', right_index=True))

Issue Description

Merging two DataFrames on a column (col1) that is Int64 in the first and int64 in the second DF, causes the following confusing error message:

Traceback (most recent call last):
  File "/Users/anto/xxx/src/bugs/pandas_join.py", line 30, in <module>
    print(df1.merge(df2, on='col1', how='left'))
  File "/Users/anto/miniconda3/conda-envs/py310/lib/python3.10/site-packages/pandas/core/frame.py", line 9339, in merge
    return merge(
  File "/Users/anto/miniconda3/conda-envs/py310/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 122, in merge
    return op.get_result()
  File "/Users/anto/miniconda3/conda-envs/py310/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 716, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "/Users/anto/miniconda3/conda-envs/py310/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 967, in _get_join_info
    (left_indexer, right_indexer) = self._get_join_indexers()
  File "/Users/anto/miniconda3/conda-envs/py310/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 941, in _get_join_indexers
    return get_join_indexers(
  File "/Users/anto/miniconda3/conda-envs/py310/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 1484, in get_join_indexers
    zipped = zip(*mapped)
  File "/Users/anto/miniconda3/conda-envs/py310/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 1481, in <genexpr>
    _factorize_keys(left_keys[n], right_keys[n], sort=sort, how=how)
  File "/Users/anto/miniconda3/conda-envs/py310/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 2164, in _factorize_keys
    lk = ensure_int64(np.asarray(lk))
  File "pandas/_libs/algos_common_helper.pxi", line 81, in pandas._libs.algos.ensure_int64
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NAType'

Expected Behavior

If the types of the "merge columns" are compatible, the merge should complete with no errors. Or, alternatively, the error message should warn the user of a type mismatch and of the need of casting one of the columns.

For example, casting the col1 in df2 from int64 to Int64 will result in the correct merge:

import pandas as pd

# First DF
df1 = pd.DataFrame({'col1': [1,1,2,2,3,pd.NA]})
df1['col1'] = df1['col1'].astype('Int64')

# Second DF
df2 = pd.DataFrame({'col1': [1,2,3], 'col2': list('abc')}).set_index('col1')

# These two lines are required for the merge to work
df2 = df2.reset_index()
df2['col1'] = df2['col1'].astype('Int64')


print(df1.dtypes)
print(df2.dtypes)
print(df1.merge(df2, on='col1', how='left'))

However, this workaround looks a bit too convoluted for me.

Installed Versions

1.4.1

@tritemio tritemio added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 28, 2022
@phofl phofl added NA - MaskedArrays Related to pd.NA and nullable extension arrays Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 10, 2022
@NickCrews
Copy link
Contributor

I duplicate reported this bug in #46799, with a couple other notes there. I confirmed that this is present on main. Also, a slightly more precise description of the conditions that create this bug are:

  • col1 is a nullable integer type (eg pd.Int64Dtype())
  • col1 contains nulls
  • col2 is an integer type (either nullable eg pd.Int64Dtype() or non-nullable eg np.int64)
  • col1.dtype != col2.dtype

@phofl
Copy link
Member

phofl commented Dec 17, 2022

This was solved recently through mask support in merge operations

@phofl phofl closed this as completed Dec 17, 2022
@Flowhill
Copy link

Flowhill commented Jun 7, 2023

This was solved recently through mask support in merge operations

Can you tell me how this was solved? I still have the same error. Trying to merge two dataframes on a column that is read as an Int64 and contains missing values. int() check fails on the NAType in the column.

@NickCrews
Copy link
Contributor

NickCrews commented Jun 7, 2023

Bummer! Thanks for the report. What is pandas.__version__?

If both columns are Int64, then this might be a different (but related) bug (see my condition above where dtypes must be different).

Give us a reproducible example if you want us to investigate.

@GRcharles
Copy link

GRcharles commented Jun 23, 2023

Hey! I'm running into this too.
In my case, I get the error merging a float field with an int64dtype field

>>> import pandas as pd
>>> pd.__version__
'2.0.2'
>>> pd.merge(pd.DataFrame([1.0]), pd.DataFrame([1.0]).astype(pd.Int64Dtype()))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "REDACTED\venv\lib\site-packages\pandas\core\reshape\merge.py", line 148, in merge
    op = _MergeOperation(
  File "REDACTED\venv\lib\site-packages\pandas\core\reshape\merge.py", line 741, in __init__
    self._maybe_coerce_merge_keys()
  File "REDACTED\venv\lib\site-packages\pandas\core\reshape\merge.py", line 1333, in _maybe_coerce_merge_keys
    casted = lk.astype(rk.dtype)  # type: ignore[arg-type]
TypeError: Cannot interpret 'Int64Dtype()' as a data type

@phofl
Copy link
Member

phofl commented Jun 23, 2023

Can you provide a reproducer?

@Flowhill
Copy link

Flowhill commented Jun 23, 2023 via email

@zerodarkzone
Copy link

Hey! I'm running into this too. In my case, I get the error merging a float field with an int64dtype field

>>> import pandas as pd
>>> pd.__version__
'2.0.2'
>>> pd.merge(pd.DataFrame([1.0]), pd.DataFrame([1.0]).astype(pd.Int64Dtype()))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "REDACTED\venv\lib\site-packages\pandas\core\reshape\merge.py", line 148, in merge
    op = _MergeOperation(
  File "REDACTED\venv\lib\site-packages\pandas\core\reshape\merge.py", line 741, in __init__
    self._maybe_coerce_merge_keys()
  File "REDACTED\venv\lib\site-packages\pandas\core\reshape\merge.py", line 1333, in _maybe_coerce_merge_keys
    casted = lk.astype(rk.dtype)  # type: ignore[arg-type]
TypeError: Cannot interpret 'Int64Dtype()' as a data type

Where you able to solve it?
I'm having the same problem

@phofl
Copy link
Member

phofl commented Aug 27, 2023

Sorry folks, this is fixed now and will be in 2.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
6 participants