Skip to content

pandas 0.23 fails on column<->index boolean merge #21119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
crepererum opened this issue May 18, 2018 · 10 comments · Fixed by #21310
Closed

pandas 0.23 fails on column<->index boolean merge #21119

crepererum opened this issue May 18, 2018 · 10 comments · Fixed by #21310
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@crepererum
Copy link
Contributor

Code Sample, a copy-pastable example if possible

import pandas as pd

df1 = pd.DataFrame({'x': pd.Series([True, False], dtype=bool)})
df2 = df1.set_index('x')

df1.merge(df2, left_on='x', right_index=True)

Problem description

On Pandas 0.23, I get the following Exception:

venv/local/lib/python2.7/site-packages/pandas/core/frame.pyc in merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
   6377                      right_on=right_on, left_index=left_index,
   6378                      right_index=right_index, sort=sort, suffixes=suffixes,
-> 6379                      copy=copy, indicator=indicator, validate=validate)
   6380
   6381     def round(self, decimals=0, *args, **kwargs):

venv/local/lib/python2.7/site-packages/pandas/core/reshape/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     58                          right_index=right_index, sort=sort, suffixes=suffixes,
     59                          copy=copy, indicator=indicator,
---> 60                          validate=validate)
     61     return op.get_result()
     62

venv/local/lib/python2.7/site-packages/pandas/core/reshape/merge.pyc in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
    552         # validate the merge keys dtypes. We may need to coerce
    553         # to avoid incompat dtypes
--> 554         self._maybe_coerce_merge_keys()
    555
    556         # If argument passed to validate,

venv/local/lib/python2.7/site-packages/pandas/core/reshape/merge.pyc in _maybe_coerce_merge_keys(self)
    976             # incompatible dtypes GH 9780, GH 15800
    977             elif is_numeric_dtype(lk) and not is_numeric_dtype(rk):
--> 978                 raise ValueError(msg)
    979             elif not is_numeric_dtype(lk) and is_numeric_dtype(rk):
    980                 raise ValueError(msg)

ValueError: You are trying to merge on bool and object columns. If you wish to proceed you should use pd.concat

Expected Output

       x
0   True
1  False

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.49-moby
machine: x86_64
processor:
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.23.0
pytest: 3.4.0
pip: 9.0.3
setuptools: 39.0.1
Cython: 0.28.1
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.5.0
sphinx: 1.6.5
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.4
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
@gfyoung gfyoung added the Bug label May 21, 2018
@gfyoung
Copy link
Member

gfyoung commented May 21, 2018

@crepererum : Thanks for reporting this! Was this working on a previous version by any chance?

@crepererum
Copy link
Contributor Author

Yes, the expected output was produced by 0.22.0.

Does pandas, by any chance, have some infrastructure for bisecting the commit / merge that introduced the issue?

@gfyoung
Copy link
Member

gfyoung commented May 21, 2018

git bisect is what you're looking for. That's a pure git construct BTW.

@gfyoung gfyoung added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Regression Functionality that used to work in a prior pandas version Indexing Related to indexing on series/frames, not to indexes themselves and removed Bug labels May 21, 2018
@crepererum
Copy link
Contributor Author

I know GIT does the job, I just thought there may be already be some scripts to simplify the build + test process or even an archive of pre built wheels for every master commit. Anyway, no worries, I'll see what if I find the the commit this week.

@crepererum
Copy link
Contributor Author

1355df64af894836f86cbc75b0c3b1435331aeb4 is the first bad commit
commit 1355df64af894836f86cbc75b0c3b1435331aeb4
Author: Paul Reidy <[email protected]>
Date:   Sun Dec 10 15:41:36 2017 +0000

    ERR: ValueError when merging on incompatible dtypes (#18674)

@jorisvandenbossche jorisvandenbossche added this to the 0.23.1 milestone May 23, 2018
@crepererum
Copy link
Contributor Author

The problem here is: the check introduced is in this commit does the correct thing. The index dtype is wrong (it's object, not bool), which can also be shown be this simple example (identical result for 0.22.0 and 0.23.0):

>>> pd.Index([True, False], dtype=bool)
Index([True, False], dtype='object')

Or in other words: the index dtype is wrong in both versions, the check that was introduced in-between just makes the problem visible.

@jorisvandenbossche
Copy link
Member

@crepererum Thanks for figuring that out.
So the underlying cause (which now comes up due to that change), is that boolean values in the index are stored as object array (we don't have a BooleanIndex), so it things the types are not compatible (object vs bool).

So we will have to 'infer' the dtype of the index to cover those cases.

@jorisvandenbossche
Copy link
Member

Yes, fully correct, we've come to the same conclusion :-)
The index dtype being object is not really "wrong", it is just how boolean values are currently stored in indexes. So the code to check if the dtypes are compatible in the merging code, needs to be made aware of this fact.

@crepererum
Copy link
Contributor Author

BTW: the problem is NOT specific to indices. This code also doesn't work:

import pandas as pd

df1 = pd.DataFrame({'x': [True, False]})
df2 = pd.DataFrame({'x': [True, False, None]})
df1.merge(df2)

This is especially bad because pandas uses None as NULL-placeholder in bool-columns.

@jorisvandenbossche
Copy link
Member

@crepererum yes, that's indeed also a reason for having object boolean columns.
We will fix this in the upcooming 0.23.1 release, I started a PR at #21310

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants