Skip to content

ENH: anti joins #49328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 57 commits into from
Closed

ENH: anti joins #49328

wants to merge 57 commits into from

Conversation

debnathshoham
Copy link
Member

@debnathshoham debnathshoham commented Oct 26, 2022

@debnathshoham debnathshoham added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 26, 2022
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a stab at this. Its a pretty tricky change so will likely take a few rounds of review.

Could you also update the benchmarks in asv_bench/benchmarks/test_merge.py to parametrize these new options? Would be interesting to see how they compare

2 4 9
>>> df1.merge(df2, on="C", how="anti_left")
A C B
0 1 5 NaN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it common to still pull the columns from the right table in an anti join? These columns will just always be NA right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @WillAyd for the review.
That is true, but I am not sure. I had taken the example from a comment in the bug report.

self.how = how_dict[self.how]
if self.left_index and self.right_index:
# Merge using `right_index` and `left_index`
left_ax = self.left.index[~self.left.index.isin(self.right.index)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if you've tried something closer to a left join with a filter on subsequent NA values compared to this. At first glance this seems like it could be a pretty expensive way of calculating this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honestly I haven't tried that, I just went with #42916 (comment) from @attack68

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, below is the asv comparison

[ 50.00%] · For pandas commit 139881dd <42916v2> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.8-Cython0.29.32-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pyarrow-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· join_merge.I8Merge.time_i8merge                                                                                                                           ok
[ 75.00%] ··· ============ ============
                  how                  
              ------------ ------------
                 inner      1.13±0.03s 
                 outer      1.12±0.02s 
                  left      1.19±0.01s 
                 right      1.24±0.05s 
               anti_left     567±20ms  
               anti_right    572±20ms  
               anti_full     577±30ms  
              ============ ============

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is the same operation as np.setdiff1d . Generally can you look at using the numpy functions here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you suggesting this numpy thing specifically here in _get_join_info ? I will have to try how that works out with something like EA / Categorical dtypes

from pandas.core.reshape.merge import merge


class Test_AntiJoin:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can all be done in the existing test_merge.py file for now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a new file, because there were separate files for asof and cross.
Let me know if you feel strongly about this, will paste this into test_merge.py

@debnathshoham
Copy link
Member Author

Thanks @WillAyd for the review.
I think I have addressed most of the changes, let me know how this look.

xrefing a PR for the same bug I did sometime back #43056. This PR is branched from the old one.

@debnathshoham debnathshoham requested a review from WillAyd October 27, 2022 07:02
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK thanks for the info. FYI in the future would just be easier for review if you to push updates back to the original PR so we can maintain the comment history in one place

"anti_right": "right",
"anti_full": "outer",
}
self.how = how_dict[self.how]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linking back to the other PR you had there is concern about side effects of running this code - can you do this without assignment back to self?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous PR, I had changed _MergeOperation so that anti's are converted into left/right/outer and relevant changes are made in the initial dataframes.
But Jeff had suggested #43056 (review) that we should keep the validation and the merge operation separate (as it is currently for the other joins).

The final result is same as before. I think I tweaked a couple of test results, but I guess those were bugs that were present earlier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I missed, so the side effect should not be in the validation.. this is in the operation

self.left_on is not None and self.right_on is not None
):
# Merge using `on` or `left_on` and `right_on`
_left = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit more complicated than what you had in the other PR - was that not working before?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you address this comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, sorry. I thought the comment above covered this.

Yes, the previous PR was working. But the validation and the actual operation was not separate.

self.how = how_dict[self.how]
if self.left_index and self.right_index:
# Merge using `right_index` and `left_index`
left_ax = self.left.index[~self.left.index.isin(self.right.index)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is the same operation as np.setdiff1d . Generally can you look at using the numpy functions here?

@debnathshoham
Copy link
Member Author

I had done something similar previously.
But for dtypes native to pandas those doesn't work / requires special handling.

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jan 19, 2023
@simonjayhawkins
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Add anti-joins to pandas.merge
3 participants