Skip to content

REGR: allow merging on object boolean columns #21310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.23.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Fixed Regressions
- Bug in :meth:`Categorical.fillna` incorrectly raising a ``TypeError`` when `value` the individual categories are iterable and `value` is an iterable (:issue:`21097`, :issue:`19788`)
- Regression in :func:`pivot_table` where an ordered ``Categorical`` with missing
values for the pivot's ``index`` would give a mis-aligned result (:issue:`21133`)

- Fixed regression in merging on boolean index/columns (:issue:`21119`).

.. _whatsnew_0231.performance:

Expand Down
10 changes: 8 additions & 2 deletions pandas/core/reshape/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
is_int_or_datetime_dtype,
is_dtype_equal,
is_bool,
is_bool_dtype,
is_list_like,
is_datetimelike,
_ensure_int64,
Expand Down Expand Up @@ -974,9 +975,14 @@ def _maybe_coerce_merge_keys(self):

# Check if we are trying to merge on obviously
# incompatible dtypes GH 9780, GH 15800
elif is_numeric_dtype(lk) and not is_numeric_dtype(rk):

# boolean values are considered as numeric, but are still allowed
# to be merged on object boolean values
elif ((is_numeric_dtype(lk) and not is_bool_dtype(lk))
and not is_numeric_dtype(rk)):
raise ValueError(msg)
elif not is_numeric_dtype(lk) and is_numeric_dtype(rk):
elif (not is_numeric_dtype(lk)
and (is_numeric_dtype(rk) and not is_bool_dtype(rk))):
raise ValueError(msg)
elif is_datetimelike(lk) and not is_datetimelike(rk):
raise ValueError(msg)
Expand Down
23 changes: 23 additions & 0 deletions pandas/tests/reshape/merge/test_merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -1526,6 +1526,27 @@ def test_merge_on_ints_floats_warning(self):
result = B.merge(A, left_on='Y', right_on='X')
assert_frame_equal(result, expected[['Y', 'X']])

def test_merge_incompat_infer_boolean_object(self):
# GH21119: bool + object bool merge OK
df1 = DataFrame({'key': Series([True, False], dtype=object)})
df2 = DataFrame({'key': [True, False]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to have a NaN in the object dtype as well, this should test inferred and bool on both sides (with and w/o NaN).


expected = DataFrame({'key': [True, False]}, dtype=object)
result = pd.merge(df1, df2, on='key')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to test the error case (e.g. bool / inferred bool) on 1 side and numeric on the other (I think this will raise), prob have an existing test, just need to add to it

assert_frame_equal(result, expected)
result = pd.merge(df2, df1, on='key')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank lines between cases

assert_frame_equal(result, expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally use the how fixture

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also not used for the other tests on this, so leaving this for now


# with missing value
df1 = DataFrame({'key': Series([True, False, np.nan], dtype=object)})
df2 = DataFrame({'key': [True, False]})

expected = DataFrame({'key': [True, False]}, dtype=object)
result = pd.merge(df1, df2, on='key')
assert_frame_equal(result, expected)
result = pd.merge(df2, df1, on='key')
assert_frame_equal(result, expected)

@pytest.mark.parametrize('df1_vals, df2_vals', [
([0, 1, 2], ["0", "1", "2"]),
([0.0, 1.0, 2.0], ["0", "1", "2"]),
Expand All @@ -1538,6 +1559,8 @@ def test_merge_on_ints_floats_warning(self):
pd.date_range('20130101', periods=3, tz='US/Eastern')),
([0, 1, 2], Series(['a', 'b', 'a']).astype('category')),
([0.0, 1.0, 2.0], Series(['a', 'b', 'a']).astype('category')),
# TODO ([0, 1], pd.Series([False, True], dtype=bool)),
([0, 1], pd.Series([False, True], dtype=object))
])
def test_merge_incompat_dtypes(self, df1_vals, df2_vals):
# GH 9780, GH 15800
Expand Down