Skip to content

ENH:included anti join functionality #43056

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 35 commits into from

Conversation

debnathshoham
Copy link
Member

Extremely naive implementation. Happy to take suggestions.

@simonjayhawkins simonjayhawkins added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Aug 16, 2021
@debnathshoham debnathshoham marked this pull request as draft August 16, 2021 18:56
@debnathshoham debnathshoham marked this pull request as ready for review August 17, 2021 14:38
@attack68
Copy link
Contributor

I haven't reviewed fully but my instinct is telling me that you have implemented this in a very complicated way, and I think a search/consideration for a much simpler approach is worthwhile. (see my comment in the underlying issue).

Additionally, your solution uses numpy and numpy array functions. Does that mean this functionality might fail with dtypes that are native to pandas?

@debnathshoham
Copy link
Member Author

Hi @attack68 , this #42916 (comment) was really helpful. I have made some changes, let me know how this looks. I believe this should be able to handle pandas dtypes.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, some comments. needs some tests with other dtypes (datetime, datetime w/timezone, categorical and other EAs), doesnt' have to be on all parameterization a separate test is ok.

update docs where needed (doc-string & in merge.rst)

@@ -744,6 +747,57 @@ def get_result(self) -> DataFrame:

return result.__finalize__(self, method="merge")

def _anti_join_update(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok with this being a method on _MergeOperation but it should have NO side effects, e.g. simply return left, right and assign at this level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if I understand what you mean by, being a method on _MergeOperation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to execute in the get_result() portion, NOT during validation. validation shouldn't dot he actual computation.

IOW during _get_join_info could dispatch based on how.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the actual calculation is still happening in get_result().
I have just changed the configuration of left, right and how to utilise the already existing joining methods. E.g. For anti_left I have changed the left, right and how to left such that left join would give the result

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @jreback - wondering if you got a chance to look at this? and if this implementation is fine?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no the implementation is quite convoluted

i am not sure how to fix what u have here

@debnathshoham
Copy link
Member Author

This is rendering little odd. I had to split it into two lines due to pep8 length limitation, but not sure how to ensure the render is in a single line.

how : {'left', 'right', 'outer', 'inner', 'cross',
  'anti_left', 'anti_right', 'anti_full'}, default 'inner'

Screenshot 2021-08-19 at 7 44 18 PM

simplified test_basic_anti_index (thx to @attck68)

Co-authored-by: attack68 <[email protected]>
@pep8speaks
Copy link

pep8speaks commented Aug 20, 2021

Hello @debnathshoham! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-01-14 18:16:37 UTC

@debnathshoham
Copy link
Member Author

hi guys, just checking if anyone got a chance to look into this..

Copy link
Contributor

@attack68 attack68 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual mechanics of this and the coding structure look good to me - I think it has been through a few iterations and is in a clean state now.

Since this is fairly complex logic, i.e. you are taking the anti-join input and manipulating it into a structure that can then process the join as usual having been pre-processed, I would, personally, add better comments and documentation, explaining the functions and different cases.

Additionally, the tests, even though they look a little unwieldy, are thorough, but there are no comments here either and I think they would be quite helpful just indicating the purpose of the test.

I would also add a sentence describing the examples you gave in the documentation.

@debnathshoham
Copy link
Member Author

thanks @attack68 for the detailed review.
I have added a couple of more comments, would be happy to add more, if you can point somewhere you think is still unclear).

@debnathshoham
Copy link
Member Author

I have added a couple of comments in the tests too. In the rest, I think the name of the tests are self-descriptive.

Copy link
Contributor

@attack68 attack68 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im pretty happy with this as is, but I do think it can be streamlined.

),
],
)
def test_anti_with_nan(self, expected, how, on, left_on, right_on):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this test doing that the other tests above do not capture?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was for mixed dtypes.
(added couple of more comments for these)

),
],
)
def test_anti_with_nan_and_NA(self, expected, how, left_on, right_on):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test purely testing pd.NA?

In this case it seems overkill. Logically speaking if the tests above imply everything is working for nan, and a simple test shows pd.Na works equivalently to nan then all tests above should work for pd.NA so we don't have to repeat everything again.

Copy link
Member Author

@debnathshoham debnathshoham Sep 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this was to check np.nan is not matched with pd.NA.
(added comment)

elif self.on is not None or (
None not in self.left_on and None not in self.right_on
):
# Merge using `on` or `left_on` and `right_on`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be honest these comments are obvious, as you said. it was more the general concept that you could have expanded in the main docstring, why you have implemented it this way and what is the advantage. Just in case you come back to look at it after two years, or another developer comes in and tries to figure out what you are doing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comments in merge.py looks sufficient to me. But again, I wrote it and a second pair of eyes can give a fresh perspective.
Do you think the logic in merge.py is too difficult to follow? (if you could point to any specific portion)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to follow the existing workflow. as-is its very hard to follow what you are trying to do. the pattern is validate the merge, then actually do the operation. you have mixed the two.

@@ -744,6 +747,57 @@ def get_result(self) -> DataFrame:

return result.__finalize__(self, method="merge")

def _anti_join_update(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to execute in the get_result() portion, NOT during validation. validation shouldn't dot he actual computation.

IOW during _get_join_info could dispatch based on how.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2021

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Nov 7, 2021
@jreback
Copy link
Contributor

jreback commented Nov 28, 2021

conceptually ok with adding this. but the implementation needs tight integration with the existing. you are bolting it on in an unmaintainable manner. if you want to continue, merge master and update.

@mroeschke
Copy link
Member

Thanks for the PR, but it appears to have gone stale. Feel free to reopen if you have time to continue with addressing the review (splitting the validate and merge operations).

@mroeschke mroeschke closed this Feb 13, 2022
@debnathshoham debnathshoham mentioned this pull request Oct 27, 2022
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Add anti-joins to pandas.merge
7 participants