-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH:included anti join functionality #43056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I haven't reviewed fully but my instinct is telling me that you have implemented this in a very complicated way, and I think a search/consideration for a much simpler approach is worthwhile. (see my comment in the underlying issue). Additionally, your solution uses |
Hi @attack68 , this #42916 (comment) was really helpful. I have made some changes, let me know how this looks. I believe this should be able to handle pandas dtypes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, some comments. needs some tests with other dtypes (datetime, datetime w/timezone, categorical and other EAs), doesnt' have to be on all parameterization a separate test is ok.
update docs where needed (doc-string & in merge.rst)
@@ -744,6 +747,57 @@ def get_result(self) -> DataFrame: | |||
|
|||
return result.__finalize__(self, method="merge") | |||
|
|||
def _anti_join_update(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok with this being a method on _MergeOperation but it should have NO side effects, e.g. simply return left, right and assign at this level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if I understand what you mean by, being a method on _MergeOperation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to execute in the get_result()
portion, NOT during validation. validation shouldn't dot he actual computation.
IOW during _get_join_info could dispatch based on how
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the actual calculation is still happening in get_result()
.
I have just changed the configuration of left
, right
and how
to utilise the already existing joining methods. E.g. For anti_left
I have changed the left
, right
and how
to left
such that left join
would give the result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @jreback - wondering if you got a chance to look at this? and if this implementation is fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no the implementation is quite convoluted
i am not sure how to fix what u have here
simplified test_basic_anti_index (thx to @attck68) Co-authored-by: attack68 <[email protected]>
Hello @debnathshoham! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2022-01-14 18:16:37 UTC |
hi guys, just checking if anyone got a chance to look into this.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual mechanics of this and the coding structure look good to me - I think it has been through a few iterations and is in a clean state now.
Since this is fairly complex logic, i.e. you are taking the anti-join input and manipulating it into a structure that can then process the join as usual having been pre-processed, I would, personally, add better comments and documentation, explaining the functions and different cases.
Additionally, the tests, even though they look a little unwieldy, are thorough, but there are no comments here either and I think they would be quite helpful just indicating the purpose of the test.
I would also add a sentence describing the examples you gave in the documentation.
thanks @attack68 for the detailed review. |
I have added a couple of comments in the tests too. In the rest, I think the name of the tests are self-descriptive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im pretty happy with this as is, but I do think it can be streamlined.
), | ||
], | ||
) | ||
def test_anti_with_nan(self, expected, how, on, left_on, right_on): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this test doing that the other tests above do not capture?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was for mixed dtypes.
(added couple of more comments for these)
), | ||
], | ||
) | ||
def test_anti_with_nan_and_NA(self, expected, how, left_on, right_on): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this test purely testing pd.NA
?
In this case it seems overkill. Logically speaking if the tests above imply everything is working for nan
, and a simple test shows pd.Na
works equivalently to nan
then all tests above should work for pd.NA
so we don't have to repeat everything again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this was to check np.nan
is not matched with pd.NA
.
(added comment)
elif self.on is not None or ( | ||
None not in self.left_on and None not in self.right_on | ||
): | ||
# Merge using `on` or `left_on` and `right_on` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be honest these comments are obvious, as you said. it was more the general concept that you could have expanded in the main docstring, why you have implemented it this way and what is the advantage. Just in case you come back to look at it after two years, or another developer comes in and tries to figure out what you are doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comments in merge.py looks sufficient to me. But again, I wrote it and a second pair of eyes can give a fresh perspective.
Do you think the logic in merge.py is too difficult to follow? (if you could point to any specific portion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to follow the existing workflow. as-is its very hard to follow what you are trying to do. the pattern is validate the merge, then actually do the operation. you have mixed the two.
@@ -744,6 +747,57 @@ def get_result(self) -> DataFrame: | |||
|
|||
return result.__finalize__(self, method="merge") | |||
|
|||
def _anti_join_update(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to execute in the get_result()
portion, NOT during validation. validation shouldn't dot he actual computation.
IOW during _get_join_info could dispatch based on how
.
Co-authored-by: JHM Darbyshire <[email protected]>
This reverts commit 411bcaa.
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
conceptually ok with adding this. but the implementation needs tight integration with the existing. you are bolting it on in an unmaintainable manner. if you want to continue, merge master and update. |
Thanks for the PR, but it appears to have gone stale. Feel free to reopen if you have time to continue with addressing the review (splitting the validate and merge operations). |
Extremely naive implementation. Happy to take suggestions.