Skip to content

ENH: anti joins #49328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 57 commits into from
Closed
Show file tree
Hide file tree
Changes from 45 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
fba0bb5
ENH:included anti join functionality
debnathshoham Aug 15, 2021
53fd41d
included multicol join
debnathshoham Aug 16, 2021
627c31e
Merge branch 'master' into 42916
debnathshoham Aug 16, 2021
448373b
handling index and col
debnathshoham Aug 17, 2021
4dd802d
added test on nan
debnathshoham Aug 17, 2021
3b17c59
Merge branch 'master' into 42916
debnathshoham Aug 17, 2021
6e3d1a4
removed tests cases with warning
debnathshoham Aug 17, 2021
6427f09
seperated antijoin tests to another file
debnathshoham Aug 18, 2021
86ddac9
replaced np funcs with pd
debnathshoham Aug 18, 2021
84294e4
added test with pd.NA
debnathshoham Aug 18, 2021
43ae0a1
suggested changes
debnathshoham Aug 19, 2021
c36705c
added tests covering Categorcal, EA, datetime,datetime w tz, EA+multicol
debnathshoham Aug 19, 2021
951406a
Update pandas/tests/reshape/merge/test_merge_anti.py
debnathshoham Aug 20, 2021
db80abf
formatted with black
debnathshoham Aug 20, 2021
d93c0ac
changed a few test setup
debnathshoham Aug 20, 2021
90af576
Merge branch 'master' into 42916
debnathshoham Sep 8, 2021
80ce02e
Merge branch 'master' into 42916
debnathshoham Sep 10, 2021
79bbbb9
removed object cast for EA dtypes; xref #43152
debnathshoham Sep 10, 2021
ee7cc16
Merge branch 'master' into 42916
debnathshoham Sep 10, 2021
76cd5c6
Merge branch 'master' into 42916
debnathshoham Sep 19, 2021
d358efc
Merge branch 'master' into 42916
debnathshoham Sep 24, 2021
14d0d4c
more comments
debnathshoham Sep 24, 2021
3fe64f4
added in merge.rst
debnathshoham Sep 24, 2021
fc50027
removed comments from example; failing doctests
debnathshoham Sep 25, 2021
aba9a30
reversed mm dd order in test_anti_datetime_tz to prevent UserWarning
debnathshoham Sep 25, 2021
411bcaa
Update pandas/core/reshape/merge.py
debnathshoham Sep 26, 2021
417ea13
Delete out.csv
debnathshoham Sep 26, 2021
09426c6
Revert "Update pandas/core/reshape/merge.py"
debnathshoham Sep 26, 2021
b6e72aa
removed files added by mistake
debnathshoham Sep 26, 2021
8338a48
more comments on tests
debnathshoham Sep 26, 2021
cc6c8ea
Merge branch 'master' into 42916
debnathshoham Sep 27, 2021
f33fe48
Merge branch 'master' into 42916
debnathshoham Sep 29, 2021
74e172b
Merge branch 'master' into 42916
debnathshoham Oct 4, 2021
594f80a
Merge branch 'master' into 42916
debnathshoham Dec 15, 2021
f395cbb
Merge branch 'main' of https://github.com/pandas-dev/pandas into 42916
debnathshoham Jan 14, 2022
bf76fda
attempt 2
debnathshoham Oct 25, 2022
32403f8
Merge branch 'main' of https://github.com/pandas-dev/pandas into 42916v2
debnathshoham Oct 25, 2022
8b9a8e5
Merge branch 'pandas-dev:main' into 42916v2
debnathshoham Oct 25, 2022
a639f3d
Merge branch '42916v2' of https://github.com/debnathshoham/pandas int…
debnathshoham Oct 25, 2022
93150b7
fixed broken tests
debnathshoham Oct 26, 2022
f63c903
removed prev implementation
debnathshoham Oct 26, 2022
9a00186
fix lint
debnathshoham Oct 26, 2022
e021a0c
added whatsnew
debnathshoham Oct 26, 2022
50ffb40
Merge branch 'main' into 42916v2
debnathshoham Oct 26, 2022
90639aa
Merge branch 'main' of https://github.com/pandas-dev/pandas into 42916v2
debnathshoham Oct 26, 2022
9393ed6
Merge branch 'main' into 42916v2
debnathshoham Oct 26, 2022
378c858
updated version
debnathshoham Oct 27, 2022
645b02d
remove unnecessary typing
debnathshoham Oct 27, 2022
139881d
Merge branch '42916v2' of https://github.com/debnathshoham/pandas int…
debnathshoham Oct 27, 2022
e2a9423
added asv benchmark for anti_joins
debnathshoham Oct 27, 2022
27f87ab
Merge branch 'main' of https://github.com/pandas-dev/pandas into 42916v2
debnathshoham Oct 27, 2022
02158be
Merge branch 'main' into 42916v2
debnathshoham Oct 27, 2022
f79e600
Merge branch 'main' into 42916v2
debnathshoham Oct 31, 2022
d64d53a
Merge branch 'main' into 42916v2
debnathshoham Nov 8, 2022
1d50ba3
Merge branch 'main' into 42916v2
debnathshoham Nov 9, 2022
858fae5
Merge branch 'main' into 42916v2
debnathshoham Nov 11, 2022
ebad0d9
Merge branch 'main' into 42916v2
debnathshoham Dec 14, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions doc/source/getting_started/comparison/includes/merge.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,12 @@ data does not have to be sorted ahead of time, and different join types are acco

outer_join = df1.merge(df2, on=["key"], how="outer")
outer_join

anti_left_join = df1.merge(df2, on=["key"], how="anti_left")
anti_left_join

anti_right_join = df1.merge(df2, on=["key"], how="anti_right")
anti_right_join

anti_full_join = df1.merge(df2, on=["key"], how="anti_full")
anti_full_join
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ Other enhancements
- Fix ``test`` optional_extra by adding missing test package ``pytest-asyncio`` (:issue:`48361`)
- :func:`DataFrame.astype` exception message thrown improved to include column name when type conversion is not possible. (:issue:`47571`)
- :meth:`DataFrame.to_json` now supports a ``mode`` keyword with supported inputs 'w' and 'a'. Defaulting to 'w', 'a' can be used when lines=True and orient='records' to append record oriented json lines to an existing json file. (:issue:`35849`)
- :meth:`DataFrame.join` now supports ``how`` with ``anti_left``, ``anti_right`` and ``anti_full`` (:issue:`42916`)

.. ---------------------------------------------------------------------------
.. _whatsnew_200.notable_bug_fixes:
Expand Down
39 changes: 38 additions & 1 deletion pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -295,7 +295,8 @@
----------%s
right : DataFrame or named Series
Object to merge with.
how : {'left', 'right', 'outer', 'inner', 'cross'}, default 'inner'
how : {'left', 'right', 'outer', 'inner', 'cross', 'anti_left', \
'anti_right', 'anti_full'}, default 'inner'
Type of merge to be performed.

* left: use only keys from left frame, similar to a SQL left outer join;
Expand All @@ -310,6 +311,15 @@
of the left keys.

.. versionadded:: 1.2.0
* anti_left: use only keys from left frame that are absent in right
frame; preserve key order.
* anti_right: use keys from the right frame that are absent in the
left frame; preserve key order.
* anti_full: use keys from the right frame that are absent in the
left frame, and the keys in the left frame that are absent in the
right frame; sort keys lexicographically.

.. versionadded:: 1.4.0

on : label or list
Column or index level names to join on. These must be found in both
Expand Down Expand Up @@ -469,6 +479,33 @@
1 foo 8
2 bar 7
3 bar 8

>>> df1 = pd.DataFrame({"A": [1, 2, 3], "C": [5, 6, 7]})
>>> df2 = pd. DataFrame({"B": [1, 2, 4], "C": [7, 8, 9]})
>>> df1
A C
0 1 5
1 2 6
2 3 7
>>> df2
B C
0 1 7
1 2 8
2 4 9
>>> df1.merge(df2, on="C", how="anti_left")
A C B
0 1 5 NaN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it common to still pull the columns from the right table in an anti join? These columns will just always be NA right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @WillAyd for the review.
That is true, but I am not sure. I had taken the example from a comment in the bug report.

1 2 6 NaN
>>> df1.merge(df2, on="C", how="anti_right")
A C B
0 NaN 8 2
1 NaN 9 4
>>> df1.merge(df2, on="C", how="anti_full")
A C B
0 1.0 5 NaN
1 2.0 6 NaN
2 NaN 8 2.0
3 NaN 9 4.0
"""


Expand Down
37 changes: 37 additions & 0 deletions pandas/core/reshape/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -1015,6 +1015,43 @@ def _get_join_info(
left_ax = self.left.axes[self.axis]
right_ax = self.right.axes[self.axis]

if self.how in ["anti_left", "anti_right", "anti_full"]:
how_dict: dict[str, str] = {
"anti_left": "left",
"anti_right": "right",
"anti_full": "outer",
}
self.how = how_dict[self.how]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linking back to the other PR you had there is concern about side effects of running this code - can you do this without assignment back to self?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous PR, I had changed _MergeOperation so that anti's are converted into left/right/outer and relevant changes are made in the initial dataframes.
But Jeff had suggested #43056 (review) that we should keep the validation and the merge operation separate (as it is currently for the other joins).

The final result is same as before. I think I tweaked a couple of test results, but I guess those were bugs that were present earlier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I missed, so the side effect should not be in the validation.. this is in the operation

if self.left_index and self.right_index:
# Merge using `right_index` and `left_index`
left_ax = self.left.index[~self.left.index.isin(self.right.index)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if you've tried something closer to a left join with a filter on subsequent NA values compared to this. At first glance this seems like it could be a pretty expensive way of calculating this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honestly I haven't tried that, I just went with #42916 (comment) from @attack68

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, below is the asv comparison

[ 50.00%] · For pandas commit 139881dd <42916v2> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.8-Cython0.29.32-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pyarrow-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· join_merge.I8Merge.time_i8merge                                                                                                                           ok
[ 75.00%] ··· ============ ============
                  how                  
              ------------ ------------
                 inner      1.13±0.03s 
                 outer      1.12±0.02s 
                  left      1.19±0.01s 
                 right      1.24±0.05s 
               anti_left     567±20ms  
               anti_right    572±20ms  
               anti_full     577±30ms  
              ============ ============

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is the same operation as np.setdiff1d . Generally can you look at using the numpy functions here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you suggesting this numpy thing specifically here in _get_join_info ? I will have to try how that works out with something like EA / Categorical dtypes

right_ax = self.right.index[~self.right.index.isin(self.left.index)]
self.left = self.left.loc[left_ax]
self.right = self.right.loc[right_ax]
elif self.on is not None or (
self.left_on is not None and self.right_on is not None
):
# Merge using `on` or `left_on` and `right_on`
_left = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit more complicated than what you had in the other PR - was that not working before?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you address this comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, sorry. I thought the comment above covered this.

Yes, the previous PR was working. But the validation and the actual operation was not separate.

~Index(self.left_join_keys[x]).isin(Index(self.right_join_keys[x]))
for x in range(len(self.left_join_keys))
]
_right = [
~Index(self.right_join_keys[x]).isin(Index(self.left_join_keys[x]))
for x in range(len(self.left_join_keys))
]
self.left = self.left[np.sum(np.stack(_left, axis=0), axis=0) > 0]
self.right = self.right[np.sum(np.stack(_right, axis=0), axis=0) > 0]

self.left_join_keys = [
x[np.sum(np.stack(_left, axis=0), axis=0) > 0]
for x in self.left_join_keys
]
self.right_join_keys = [
x[np.sum(np.stack(_right, axis=0), axis=0) > 0]
for x in self.right_join_keys
]

if self.left_index and self.right_index and self.how != "asof":
join_index, left_indexer, right_indexer = left_ax.join(
right_ax, how=self.how, return_indexers=True, sort=self.sort
Expand Down
Loading