-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: anti joins #49328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: anti joins #49328
Conversation
simplified test_basic_anti_index (thx to @attck68) Co-authored-by: attack68 <[email protected]>
Co-authored-by: JHM Darbyshire <[email protected]>
This reverts commit 411bcaa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a stab at this. Its a pretty tricky change so will likely take a few rounds of review.
Could you also update the benchmarks in asv_bench/benchmarks/test_merge.py
to parametrize these new options? Would be interesting to see how they compare
2 4 9 | ||
>>> df1.merge(df2, on="C", how="anti_left") | ||
A C B | ||
0 1 5 NaN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it common to still pull the columns from the right table in an anti join? These columns will just always be NA right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @WillAyd for the review.
That is true, but I am not sure. I had taken the example from a comment in the bug report.
self.how = how_dict[self.how] | ||
if self.left_index and self.right_index: | ||
# Merge using `right_index` and `left_index` | ||
left_ax = self.left.index[~self.left.index.isin(self.right.index)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious if you've tried something closer to a left join with a filter on subsequent NA values compared to this. At first glance this seems like it could be a pretty expensive way of calculating this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
honestly I haven't tried that, I just went with #42916 (comment) from @attack68
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, below is the asv comparison
[ 50.00%] · For pandas commit 139881dd <42916v2> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.8-Cython0.29.32-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pyarrow-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· join_merge.I8Merge.time_i8merge ok
[ 75.00%] ··· ============ ============
how
------------ ------------
inner 1.13±0.03s
outer 1.12±0.02s
left 1.19±0.01s
right 1.24±0.05s
anti_left 567±20ms
anti_right 572±20ms
anti_full 577±30ms
============ ============
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this is the same operation as np.setdiff1d . Generally can you look at using the numpy functions here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you suggesting this numpy
thing specifically here in _get_join_info
? I will have to try how that works out with something like EA / Categorical dtypes
from pandas.core.reshape.merge import merge | ||
|
||
|
||
class Test_AntiJoin: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can all be done in the existing test_merge.py
file for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a new file, because there were separate files for asof
and cross
.
Let me know if you feel strongly about this, will paste this into test_merge.py
Co-authored-by: William Ayd <[email protected]>
Co-authored-by: William Ayd <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK thanks for the info. FYI in the future would just be easier for review if you to push updates back to the original PR so we can maintain the comment history in one place
"anti_right": "right", | ||
"anti_full": "outer", | ||
} | ||
self.how = how_dict[self.how] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linking back to the other PR you had there is concern about side effects of running this code - can you do this without assignment back to self?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the previous PR, I had changed _MergeOperation
so that anti's are converted into left/right/outer and relevant changes are made in the initial dataframes.
But Jeff had suggested #43056 (review) that we should keep the validation and the merge operation separate (as it is currently for the other joins).
The final result is same as before. I think I tweaked a couple of test results, but I guess those were bugs that were present earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I missed, so the side effect should not be in the validation.. this is in the operation
self.left_on is not None and self.right_on is not None | ||
): | ||
# Merge using `on` or `left_on` and `right_on` | ||
_left = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a bit more complicated than what you had in the other PR - was that not working before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you address this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, sorry. I thought the comment above covered this.
Yes, the previous PR was working. But the validation and the actual operation was not separate.
self.how = how_dict[self.how] | ||
if self.left_index and self.right_index: | ||
# Merge using `right_index` and `left_index` | ||
left_ax = self.left.index[~self.left.index.isin(self.right.index)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this is the same operation as np.setdiff1d . Generally can you look at using the numpy functions here?
I had done something similar previously. |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.