Skip to content

BUG: unexpected merge_ordered results caused by wrongly groupby #38170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Dec 2, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -756,6 +756,8 @@ Reshaping
- Bug in :func:`concat` resulting in a ``ValueError`` when at least one of both inputs had a non-unique index (:issue:`36263`)
- Bug in :meth:`DataFrame.merge` and :meth:`pandas.merge` returning inconsistent ordering in result for ``how=right`` and ``how=left`` (:issue:`35382`)
- Bug in :func:`merge_ordered` couldn't handle list-like ``left_by`` or ``right_by`` (:issue:`35269`)
- Bug in :func:`merge_ordered` returned wrong join result when length of ``left_by`` or ``right_by`` equals to the rows of ``left`` or ``right`` (:issue:`38166`)
- Bug in :func:`merge_ordered` didn't raise when elements in ``left_by`` or ``right_by`` not exist in ``left`` columns or ``right`` columns (:issue:`38167`)

Sparse
^^^^^^
Expand Down
15 changes: 11 additions & 4 deletions pandas/core/reshape/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,11 +114,8 @@ def _groupby_and_merge(by, on, left: "DataFrame", right: "DataFrame", merge_piec

# if we can groupby the rhs
# then we can get vastly better perf

try:
if all(item in right.columns for item in by):
rby = right.groupby(by, sort=False)
except KeyError:
pass

for key, lhs in lby:

Expand Down Expand Up @@ -274,10 +271,20 @@ def _merger(x, y):
if left_by is not None and right_by is not None:
raise ValueError("Can only group either left or right frames")
elif left_by is not None:
if isinstance(left_by, str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its possible that we want to integrate these checks much more into: https://github.com/pandas-dev/pandas/blob/master/pandas/core/reshape/merge.py#L1247

but this may require some refactoring. (as a followup if you can)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a try.

left_by = [left_by]
check = set(left_by).difference(left.columns)
if len(check) != 0:
raise KeyError(f"{check} not found in left columns")
result, _ = _groupby_and_merge(
left_by, on, left, right, lambda x, y: _merger(x, y)
)
elif right_by is not None:
if isinstance(right_by, str):
right_by = [right_by]
check = set(right_by).difference(right.columns)
if len(check) != 0:
raise KeyError(f"{check} not found in right columns")
result, _ = _groupby_and_merge(
right_by, on, right, left, lambda x, y: _merger(y, x)
)
Expand Down
19 changes: 19 additions & 0 deletions pandas/tests/reshape/merge/test_merge_ordered.py
Original file line number Diff line number Diff line change
Expand Up @@ -177,3 +177,22 @@ def test_list_type_by(self, left, right, on, left_by, right_by, expected):
)

tm.assert_frame_equal(result, expected)

def test_left_by_length_equals_to_right_shape0(self):
# GH 38166
left = DataFrame([["g", "h", 1], ["g", "h", 3]], columns=list("GHT"))
right = DataFrame([[2, 1]], columns=list("TE"))
result = merge_ordered(left, right, on="T", left_by=["G", "H"])
expected = DataFrame(
{"G": ["g"] * 3, "H": ["h"] * 3, "T": [1, 2, 3], "E": [np.nan, 1.0, np.nan]}
)

tm.assert_frame_equal(result, expected)

def test_elements_not_in_by_but_in_df(self):
# GH 38167
left = DataFrame([["g", "h", 1], ["g", "h", 3]], columns=list("GHT"))
right = DataFrame([[2, 1]], columns=list("TE"))
msg = r"\{'h'\} not found in left columns"
with pytest.raises(KeyError, match=msg):
merge_ordered(left, right, on="T", left_by=["G", "h"])