Skip to content

ENH: Implement cross method for Merge Operations #37864

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Nov 26, 2020
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
71edcce
First cross merge draft for merge operation
phofl Nov 15, 2020
cc5d779
Merge branch 'master' of https://github.com/pandas-dev/pandas
phofl Nov 15, 2020
f573ca4
Fix variable assignment
phofl Nov 15, 2020
0acdd00
Adress review comments
phofl Nov 18, 2020
949185e
Change function signature
phofl Nov 18, 2020
2d5ccaa
Add cross functionality for join
phofl Nov 18, 2020
9098852
Change docs
phofl Nov 18, 2020
60f6b25
Add asvs
phofl Nov 18, 2020
0601243
Move import
phofl Nov 18, 2020
c46eab3
Assign value
phofl Nov 18, 2020
6274120
Reduce asvs
phofl Nov 18, 2020
891785d
Merge branch 'master' of https://github.com/pandas-dev/pandas into 5401
phofl Nov 18, 2020
1f0a1c8
Remove whitespaces
phofl Nov 18, 2020
d7c1156
Move import
phofl Nov 18, 2020
7c8d37a
Adress review
phofl Nov 19, 2020
651540e
Fix doc checks
phofl Nov 19, 2020
f7cdd4d
Add docstring
phofl Nov 19, 2020
a4d24a9
Change example
phofl Nov 19, 2020
741b4b7
Fix typos and rename variables
phofl Nov 21, 2020
0ff78fc
Check unmodified inputs
phofl Nov 21, 2020
94316f3
Add examples
phofl Nov 22, 2020
94b1367
Add tests
phofl Nov 22, 2020
77a9e23
Fix doc
phofl Nov 22, 2020
4597642
Change signature
phofl Nov 23, 2020
67a67a6
Move test
phofl Nov 23, 2020
f731081
Delete import
phofl Nov 23, 2020
4fcde78
Create new file
phofl Nov 23, 2020
4589651
Raise if duplicate on column
phofl Nov 24, 2020
d964ef1
Revert "Raise if duplicate on column"
phofl Nov 25, 2020
a1eeaa4
Merge branch 'master' of https://github.com/pandas-dev/pandas into 5401
phofl Nov 25, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions asv_bench/benchmarks/join_merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,9 @@ def time_join_dataframe_index_single_key_small(self, sort):
def time_join_dataframe_index_shuffle_key_bigger_sort(self, sort):
self.df_shuf.join(self.df_key2, on="key2", sort=sort)

def time_join_dataframes_cross(self):
self.df.loc[:2000].join(self.df_key1, how="cross")


class JoinIndex:
def setup(self):
Expand Down Expand Up @@ -205,6 +208,9 @@ def time_merge_dataframe_integer_2key(self, sort):
def time_merge_dataframe_integer_key(self, sort):
merge(self.df, self.df2, on="key1", sort=sort)

def time_merge_dataframes_cross(self):
merge(self.left.loc[:2000], self.right.loc[:2000], how="cross")


class I8Merge:

Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -252,6 +252,7 @@ Other enhancements
- Improve error reporting for :meth:`DataFrame.merge()` when invalid merge column definitions were given (:issue:`16228`)
- Improve numerical stability for :meth:`Rolling.skew()`, :meth:`Rolling.kurt()`, :meth:`Expanding.skew()` and :meth:`Expanding.kurt()` through implementation of Kahan summation (:issue:`6929`)
- Improved error reporting for subsetting columns of a :class:`DataFrameGroupBy` with ``axis=1`` (:issue:`37725`)
- Implement method ``cross`` for :meth:`DataFrame.merge` and :meth:`DataFrame.join` (:issue:`5401`)

.. ---------------------------------------------------------------------------

Expand Down
18 changes: 17 additions & 1 deletion pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -205,12 +205,14 @@
The join is done on columns or indexes. If joining columns on
columns, the DataFrame indexes *will be ignored*. Otherwise if joining indexes
on indexes or indexes on a column or columns, the index will be passed on.
When performing a cross merge, no column specifications to merge on are
allowed.

Parameters
----------%s
right : DataFrame or named Series
Object to merge with.
how : {'left', 'right', 'outer', 'inner'}, default 'inner'
how : {'left', 'right', 'outer', 'inner', 'cross'}, default 'inner'
Type of merge to be performed.

* left: use only keys from left frame, similar to a SQL left outer join;
Expand All @@ -221,6 +223,11 @@
join; sort keys lexicographically.
* inner: use intersection of keys from both frames, similar to a SQL inner
join; preserve the order of the left keys.
* cross: creates the karthesian product from both frames, preserves the order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cartesian

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

German influence, sorry :)

of the left keys.

.. versionadded:: 1.2.0

on : label or list
Column or index level names to join on. These must be found in both
DataFrames. If `on` is None and not merging on indexes then this defaults
Expand Down Expand Up @@ -8065,6 +8072,15 @@ def _join_compat(
other = DataFrame({other.name: other})

if isinstance(other, DataFrame):
if how == "cross":
return merge(
self,
other,
how=how,
on=on,
suffixes=(lsuffix, rsuffix),
sort=sort,
)
return merge(
self,
other,
Expand Down
46 changes: 44 additions & 2 deletions pandas/core/reshape/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import copy
import datetime
from functools import partial
import hashlib
import string
from typing import TYPE_CHECKING, Optional, Tuple, cast
import warnings
Expand Down Expand Up @@ -643,6 +644,18 @@ def __init__(

self._validate_specification()

if self.how == "cross":
(
self.left,
self.right,
self.how,
cross_col,
) = self._create_cross_configuration(self.left, self.right)
self.left_on = self.right_on = [cross_col]
self._cross = cross_col
else:
self._cross = None

# note this function has side effects
(
self.left_join_keys,
Expand Down Expand Up @@ -690,8 +703,14 @@ def get_result(self):

self._maybe_restore_index_levels(result)

self._maybe_drop_cross_column(result, self._cross)

return result.__finalize__(self, method="merge")

def _maybe_drop_cross_column(self, result: "DataFrame", cross_col: str):
if cross_col is not None:
result.drop(columns=cross_col, inplace=True)

def _indicator_pre_merge(
self, left: "DataFrame", right: "DataFrame"
) -> Tuple["DataFrame", "DataFrame"]:
Expand Down Expand Up @@ -1200,9 +1219,32 @@ def _maybe_coerce_merge_keys(self):
typ = rk.categories.dtype if rk_is_cat else object
self.right = self.right.assign(**{name: self.right[name].astype(typ)})

def _create_cross_configuration(
self, _left, _right
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name these left, right

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

) -> Tuple["DataFrame", "DataFrame", str, str]:
cross_col = f"_cross_{hashlib.md5().hexdigest()}"
how = "inner"
return (
_left.assign(**{cross_col: 1}),
_right.assign(**{cross_col: 1}),
how,
cross_col,
)

def _validate_specification(self):
if self.how == "cross":
if (
self.left_index
or self.right_index
or self.right_on is not None
or self.left_on is not None
or self.on is not None
):
raise MergeError(
"Can not pass any merge columns when using cross as merge method"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you say that left_on,right_on,on must be None, and left_index,right_index must be False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, done

)
# Hm, any way to make this logic less complicated??
if self.on is None and self.left_on is None and self.right_on is None:
elif self.on is None and self.left_on is None and self.right_on is None:

if self.left_index and self.right_index:
self.left_on, self.right_on = (), ()
Expand Down Expand Up @@ -1266,7 +1308,7 @@ def _validate_specification(self):
'of levels in the index of "left"'
)
self.left_on = [None] * n
if len(self.right_on) != len(self.left_on):
if self.how != "cross" and len(self.right_on) != len(self.left_on):
raise ValueError("len(right_on) must equal len(left_on)")

def _validate(self, validate: str):
Expand Down
23 changes: 23 additions & 0 deletions pandas/tests/reshape/merge/test_join.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
import numpy as np
import pytest

from pandas.errors import MergeError

import pandas as pd
from pandas import DataFrame, Index, MultiIndex, Series, concat, merge
import pandas._testing as tm
Expand Down Expand Up @@ -803,3 +805,24 @@ def test_join_inner_multiindex_deterministic_order():
index=MultiIndex.from_tuples([(2, 1, 4, 3)], names=("b", "a", "d", "c")),
)
tm.assert_frame_equal(result, expected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you moe to test_cross_merge (same dir)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand you correctly, it is directly below the other cross tests


@pytest.mark.parametrize(
("input_col", "output_cols"), [("b", ["a", "b"]), ("a", ["a_x", "a_y"])]
)
def test_join_cross(input_col, output_cols):
# GH#5401
left = DataFrame({"a": [1, 3]})
right = DataFrame({input_col: [3, 4]})
result = left.join(right, how="cross", lsuffix="_x", rsuffix="_y")
expected = DataFrame({output_cols[0]: [1, 1, 3, 3], output_cols[1]: [3, 4, 3, 4]})
tm.assert_frame_equal(result, expected)


def test_join_cross_error_reporting():
# GH#5401
left = DataFrame({"a": [1, 3]})
right = DataFrame({"a": [3, 4]})
msg = "Can not pass any merge columns when using cross as merge method"
with pytest.raises(MergeError, match=msg):
left.join(right, how="cross", on="a")
31 changes: 31 additions & 0 deletions pandas/tests/reshape/merge/test_merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -2337,3 +2337,34 @@ def test_merge_join_cols_error_reporting_on_and_index(func, kwargs):
)
with pytest.raises(MergeError, match=msg):
getattr(pd, func)(left, right, on="a", **kwargs)


@pytest.mark.parametrize(
("input_col", "output_cols"), [("b", ["a", "b"]), ("a", ["a_x", "a_y"])]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry maybe i wasn't clear, can you make a new file called test_merge_cross.py and put these there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaah ok, created the file and moved the tests

)
def test_merge_cross(input_col, output_cols):
# GH#5401
left = DataFrame({"a": [1, 3]})
right = DataFrame({input_col: [3, 4]})
result = merge(left, right, how="cross")
expected = DataFrame({output_cols[0]: [1, 1, 3, 3], output_cols[1]: [3, 4, 3, 4]})
tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize(
"kwargs",
[
{"left_index": True},
{"right_index": True},
{"on": "a"},
{"left_on": "a"},
{"right_on": "b"},
],
)
def test_merge_cross_error_reporting(kwargs):
# GH#5401
left = DataFrame({"a": [1, 3]})
right = DataFrame({"b": [3, 4]})
msg = "Can not pass any merge columns when using cross as merge method"
with pytest.raises(MergeError, match=msg):
merge(left, right, how="cross", **kwargs)