ENH: anti joins #49328

debnathshoham · 2022-10-26T08:22:17Z

closes ENH: Add anti-joins to pandas.merge #42916 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

simplified test_basic_anti_index (thx to @attck68) Co-authored-by: attack68 <[email protected]>

Co-authored-by: JHM Darbyshire <[email protected]>

This reverts commit 411bcaa.

WillAyd

Thanks for taking a stab at this. Its a pretty tricky change so will likely take a few rounds of review.

Could you also update the benchmarks in asv_bench/benchmarks/test_merge.py to parametrize these new options? Would be interesting to see how they compare

WillAyd · 2022-10-26T23:01:10Z

pandas/core/frame.py

+2  4  9
+>>> df1.merge(df2, on="C", how="anti_left")
+   A  C   B
+0  1  5 NaN


Is it common to still pull the columns from the right table in an anti join? These columns will just always be NA right?

Thanks @WillAyd for the review.
That is true, but I am not sure. I had taken the example from a comment in the bug report.

pandas/core/frame.py

WillAyd · 2022-10-26T23:04:10Z

pandas/core/reshape/merge.py

+            self.how = how_dict[self.how]
+            if self.left_index and self.right_index:
+                # Merge using `right_index` and `left_index`
+                left_ax = self.left.index[~self.left.index.isin(self.right.index)]


Curious if you've tried something closer to a left join with a filter on subsequent NA values compared to this. At first glance this seems like it could be a pretty expensive way of calculating this

honestly I haven't tried that, I just went with #42916 (comment) from @attack68

For reference, below is the asv comparison

[ 50.00%] · For pandas commit 139881dd <42916v2> (round 2/2): [ 50.00%] ·· Benchmarking conda-py3.8-Cython0.29.32-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pyarrow-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt [ 75.00%] ··· join_merge.I8Merge.time_i8merge ok [ 75.00%] ··· ============ ============ how ------------ ------------ inner 1.13±0.03s outer 1.12±0.02s left 1.19±0.01s right 1.24±0.05s anti_left 567±20ms anti_right 572±20ms anti_full 577±30ms ============ ============

IIUC this is the same operation as np.setdiff1d . Generally can you look at using the numpy functions here?

are you suggesting this numpy thing specifically here in _get_join_info ? I will have to try how that works out with something like EA / Categorical dtypes

pandas/core/reshape/merge.py

WillAyd · 2022-10-26T23:06:50Z

pandas/tests/reshape/merge/test_merge_anti.py

+from pandas.core.reshape.merge import merge
+
+
+class Test_AntiJoin:


I think this can all be done in the existing test_merge.py file for now

I created a new file, because there were separate files for asof and cross.
Let me know if you feel strongly about this, will paste this into test_merge.py

Co-authored-by: William Ayd <[email protected]>

…o 42916v2

debnathshoham · 2022-10-27T07:02:11Z

Thanks @WillAyd for the review.
I think I have addressed most of the changes, let me know how this look.

xrefing a PR for the same bug I did sometime back #43056. This PR is branched from the old one.

WillAyd

Ah OK thanks for the info. FYI in the future would just be easier for review if you to push updates back to the original PR so we can maintain the comment history in one place

WillAyd · 2022-10-27T15:06:27Z

pandas/core/reshape/merge.py

+                "anti_right": "right",
+                "anti_full": "outer",
+            }
+            self.how = how_dict[self.how]


Linking back to the other PR you had there is concern about side effects of running this code - can you do this without assignment back to self?

In the previous PR, I had changed _MergeOperation so that anti's are converted into left/right/outer and relevant changes are made in the initial dataframes.
But Jeff had suggested #43056 (review) that we should keep the validation and the merge operation separate (as it is currently for the other joins).

The final result is same as before. I think I tweaked a couple of test results, but I guess those were bugs that were present earlier.

I think I missed, so the side effect should not be in the validation.. this is in the operation

WillAyd · 2022-10-27T15:09:52Z

pandas/core/reshape/merge.py

+                self.left_on is not None and self.right_on is not None
+            ):
+                # Merge using `on` or `left_on` and `right_on`
+                _left = [


This looks a bit more complicated than what you had in the other PR - was that not working before?

Can you address this comment?

Hey, sorry. I thought the comment above covered this.

Yes, the previous PR was working. But the validation and the actual operation was not separate.

WillAyd · 2022-11-15T05:36:15Z

pandas/core/reshape/merge.py

+            self.how = how_dict[self.how]
+            if self.left_index and self.right_index:
+                # Merge using `right_index` and `left_index`
+                left_ax = self.left.index[~self.left.index.isin(self.right.index)]


IIUC this is the same operation as np.setdiff1d . Generally can you look at using the numpy functions here?

debnathshoham · 2022-11-15T06:03:28Z

I had done something similar previously.
But for dtypes native to pandas those doesn't work / requires special handling.

github-actions · 2023-01-19T04:04:24Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

simonjayhawkins · 2023-02-22T13:41:49Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

debnathshoham and others added 30 commits August 15, 2021 23:38

ENH:included anti join functionality

fba0bb5

included multicol join

53fd41d

Merge branch 'master' into 42916

627c31e

handling index and col

448373b

added test on nan

4dd802d

Merge branch 'master' into 42916

3b17c59

removed tests cases with warning

6e3d1a4

seperated antijoin tests to another file

6427f09

replaced np funcs with pd

86ddac9

added test with pd.NA

84294e4

suggested changes

43ae0a1

added tests covering Categorcal, EA, datetime,datetime w tz, EA+multicol

c36705c

Update pandas/tests/reshape/merge/test_merge_anti.py

951406a

simplified test_basic_anti_index (thx to @attck68) Co-authored-by: attack68 <[email protected]>

formatted with black

db80abf

changed a few test setup

d93c0ac

Merge branch 'master' into 42916

90af576

Merge branch 'master' into 42916

80ce02e

removed object cast for EA dtypes; xref pandas-dev#43152

79bbbb9

Merge branch 'master' into 42916

ee7cc16

Merge branch 'master' into 42916

76cd5c6

Merge branch 'master' into 42916

d358efc

more comments

14d0d4c

added in merge.rst

3fe64f4

removed comments from example; failing doctests

fc50027

reversed mm dd order in test_anti_datetime_tz to prevent UserWarning

aba9a30

Update pandas/core/reshape/merge.py

411bcaa

Co-authored-by: JHM Darbyshire <[email protected]>

Delete out.csv

417ea13

Revert "Update pandas/core/reshape/merge.py"

09426c6

This reverts commit 411bcaa.

removed files added by mistake

b6e72aa

more comments on tests

8338a48

debnathshoham added 3 commits October 26, 2022 13:50

added whatsnew

e021a0c

Merge branch 'main' into 42916v2

50ffb40

Merge branch 'main' of https://github.com/pandas-dev/pandas into 42916v2

90639aa

debnathshoham added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 26, 2022

Merge branch 'main' into 42916v2

9393ed6

WillAyd requested changes Oct 26, 2022

View reviewed changes

debnathshoham and others added 5 commits October 27, 2022 11:10

updated version

378c858

Co-authored-by: William Ayd <[email protected]>

remove unnecessary typing

645b02d

Co-authored-by: William Ayd <[email protected]>

Merge branch '42916v2' of https://github.com/debnathshoham/pandas int…

139881d

…o 42916v2

added asv benchmark for anti_joins

e2a9423

Merge branch 'main' of https://github.com/pandas-dev/pandas into 42916v2

27f87ab

debnathshoham requested a review from WillAyd October 27, 2022 07:02

WillAyd requested changes Oct 27, 2022

View reviewed changes

Merge branch 'main' into 42916v2

02158be

debnathshoham requested review from WillAyd, mroeschke and phofl October 29, 2022 16:58

debnathshoham added 4 commits October 31, 2022 15:14

Merge branch 'main' into 42916v2

f79e600

Merge branch 'main' into 42916v2

d64d53a

Merge branch 'main' into 42916v2

1d50ba3

Merge branch 'main' into 42916v2

858fae5

WillAyd requested changes Nov 15, 2022

View reviewed changes

Merge branch 'main' into 42916v2

ebad0d9

github-actions bot added the Stale label Jan 19, 2023

simonjayhawkins closed this Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: anti joins #49328

ENH: anti joins #49328

debnathshoham commented Oct 26, 2022 •

edited

Loading

WillAyd left a comment

WillAyd Oct 26, 2022

debnathshoham Oct 27, 2022

WillAyd Oct 26, 2022

debnathshoham Oct 27, 2022

debnathshoham Oct 27, 2022

WillAyd Nov 15, 2022

debnathshoham Nov 15, 2022

WillAyd Oct 26, 2022

debnathshoham Oct 27, 2022

debnathshoham commented Oct 27, 2022

WillAyd left a comment

WillAyd Oct 27, 2022

debnathshoham Oct 27, 2022

debnathshoham Oct 27, 2022

WillAyd Oct 27, 2022

WillAyd Nov 8, 2022

debnathshoham Nov 8, 2022

WillAyd Nov 15, 2022

debnathshoham commented Nov 15, 2022

github-actions bot commented Jan 19, 2023

simonjayhawkins commented Feb 22, 2023

		from pandas.core.reshape.merge import merge


		class Test_AntiJoin:

ENH: anti joins #49328

ENH: anti joins #49328

Conversation

debnathshoham commented Oct 26, 2022 • edited Loading

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

debnathshoham commented Oct 27, 2022

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

debnathshoham commented Nov 15, 2022

github-actions bot commented Jan 19, 2023

simonjayhawkins commented Feb 22, 2023

debnathshoham commented Oct 26, 2022 •

edited

Loading