-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: anti joins #49328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: anti joins #49328
Changes from 56 commits
fba0bb5
53fd41d
627c31e
448373b
4dd802d
3b17c59
6e3d1a4
6427f09
86ddac9
84294e4
43ae0a1
c36705c
951406a
db80abf
d93c0ac
90af576
80ce02e
79bbbb9
ee7cc16
76cd5c6
d358efc
14d0d4c
3fe64f4
fc50027
aba9a30
411bcaa
417ea13
09426c6
b6e72aa
8338a48
cc6c8ea
f33fe48
74e172b
594f80a
f395cbb
bf76fda
32403f8
8b9a8e5
a639f3d
93150b7
f63c903
9a00186
e021a0c
50ffb40
90639aa
9393ed6
378c858
645b02d
139881d
e2a9423
27f87ab
02158be
f79e600
d64d53a
1d50ba3
858fae5
ebad0d9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1014,6 +1014,43 @@ def _get_join_info( | |
left_ax = self.left.axes[self.axis] | ||
right_ax = self.right.axes[self.axis] | ||
|
||
if self.how in ["anti_left", "anti_right", "anti_full"]: | ||
how_dict = { | ||
"anti_left": "left", | ||
"anti_right": "right", | ||
"anti_full": "outer", | ||
} | ||
self.how = how_dict[self.how] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Linking back to the other PR you had there is concern about side effects of running this code - can you do this without assignment back to self? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the previous PR, I had changed The final result is same as before. I think I tweaked a couple of test results, but I guess those were bugs that were present earlier. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I missed, so the side effect should not be in the validation.. this is in the operation |
||
if self.left_index and self.right_index: | ||
# Merge using `right_index` and `left_index` | ||
left_ax = self.left.index[~self.left.index.isin(self.right.index)] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Curious if you've tried something closer to a left join with a filter on subsequent NA values compared to this. At first glance this seems like it could be a pretty expensive way of calculating this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. honestly I haven't tried that, I just went with #42916 (comment) from @attack68 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For reference, below is the asv comparison [ 50.00%] · For pandas commit 139881dd <42916v2> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.8-Cython0.29.32-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pyarrow-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· join_merge.I8Merge.time_i8merge ok
[ 75.00%] ··· ============ ============
how
------------ ------------
inner 1.13±0.03s
outer 1.12±0.02s
left 1.19±0.01s
right 1.24±0.05s
anti_left 567±20ms
anti_right 572±20ms
anti_full 577±30ms
============ ============ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIUC this is the same operation as np.setdiff1d . Generally can you look at using the numpy functions here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are you suggesting this |
||
right_ax = self.right.index[~self.right.index.isin(self.left.index)] | ||
self.left = self.left.loc[left_ax] | ||
self.right = self.right.loc[right_ax] | ||
elif self.on is not None or ( | ||
self.left_on is not None and self.right_on is not None | ||
): | ||
# Merge using `on` or `left_on` and `right_on` | ||
_left = [ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks a bit more complicated than what you had in the other PR - was that not working before? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you address this comment? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hey, sorry. I thought the comment above covered this. Yes, the previous PR was working. But the validation and the actual operation was not separate. |
||
~Index(self.left_join_keys[x]).isin(Index(self.right_join_keys[x])) | ||
for x in range(len(self.left_join_keys)) | ||
] | ||
_right = [ | ||
~Index(self.right_join_keys[x]).isin(Index(self.left_join_keys[x])) | ||
for x in range(len(self.left_join_keys)) | ||
] | ||
self.left = self.left[np.sum(np.stack(_left, axis=0), axis=0) > 0] | ||
self.right = self.right[np.sum(np.stack(_right, axis=0), axis=0) > 0] | ||
|
||
self.left_join_keys = [ | ||
x[np.sum(np.stack(_left, axis=0), axis=0) > 0] | ||
for x in self.left_join_keys | ||
] | ||
self.right_join_keys = [ | ||
x[np.sum(np.stack(_right, axis=0), axis=0) > 0] | ||
for x in self.right_join_keys | ||
] | ||
|
||
if self.left_index and self.right_index and self.how != "asof": | ||
join_index, left_indexer, right_indexer = left_ax.join( | ||
right_ax, how=self.how, return_indexers=True, sort=self.sort | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it common to still pull the columns from the right table in an anti join? These columns will just always be NA right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @WillAyd for the review.
That is true, but I am not sure. I had taken the example from a comment in the bug report.