Skip to content

ENH:Create Set Operations #42177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tlplayer opened this issue Jun 21, 2021 · 7 comments
Open

ENH:Create Set Operations #42177

tlplayer opened this issue Jun 21, 2021 · 7 comments
Labels
Enhancement Needs Discussion Requires discussion from core team before further action setops union, intersection, difference, symmetric_difference

Comments

@tlplayer
Copy link

Is your feature request related to a problem?

I wish to take the AND, OR, XOR, and NOT of dataframes. Of course I could do this manually but an inbuilt way would be far cleaner and elegant.

Possible Solution

Example 1:

A = {'a': 1 2 3}
B = {'a': 1 3 4}

pandas.xor(A,B, on = "a")

{'a': 2 4}

Example 2:

A.or( B, on= "a")

{'a': 1 2 3 4}

Example 3:

A.and(B,on='a')

{'a': 1 3}

API breaking implications

It will not affect the API it is just a convenience feature.

Additional context

Typically when comparing data from 2 sources, fields will not correlate and need to be cleaned through basic and, or, nor, and not, and xor operations. This would speed up greatly those tasks.

Code to come later wanting to hear you're thoughts on implementation first.
@tlplayer tlplayer added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 21, 2021
@jbrockmendel jbrockmendel added the setops union, intersection, difference, symmetric_difference label Jun 21, 2021
@AlexKirko
Copy link
Member

AlexKirko commented Jun 22, 2021

I believe this is mostly a duplicate of #4480, which is the same thing but for Series, and there was plenty of discussion and attempts to contribute something that would be better than just accessing Series.values and then using numpy-level set operations. Considering that there was nothing implemented for that issue in the end, I do not believe we'll get anywhere here, since performing set operations on a DataFrame is more arbitrary than on a Series (multi-level indices and such come to mind).

I'll leave this issue open for now, since it's not technically a duplicate and I may be missing something.

@jbrockmendel jbrockmendel removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jun 24, 2021
@parkdj1
Copy link
Contributor

parkdj1 commented Jun 24, 2021

You can do most of these with a single line already (perhaps not as intuitive but possible)!

Using your example of
A = {'a': 1 2 3}
B = {'a': 1 3 4}

The 'OR' function to get {'a': 1 2 3 4} can be accomplished with
A.merge(B,how='outer',on='a')

The 'AND' function to get {'a': 1 3} can be accomplished with
A.merge(B,how='inner',on='a')

The 'XOR' function to get {'a': 2 4} would require a bit more manipulation, but still simple enough with a different approach imho
A.append(B).drop_duplicates(keep=False)

Not sure what taking the NOT of a dataframe would mean, but perhaps the .ne function would help? Unless you're talking about doing something like A.OR(NOT B).

@tlplayer
Copy link
Author

I was thinking more A.not(B) would be like A[~A.isin(B)]. I just think it would be more intuitive that way.

@parkdj1
Copy link
Contributor

parkdj1 commented Jun 28, 2021

Hmm yeah I see what you mean. I would probably just do A[~A['a'].isin(B['a'])] which is basically what you already said.

+1 for it would be nice to have some functionality in subsetting dataframes that is more intuitive and doesn't require use of index/col names

@AlexKirko
Copy link
Member

I don't think that the pain of using less readable syntax is worth the legwork this will take to implement, but that's for the contributor to decide.

@Delengowski
Copy link
Contributor

Delengowski commented Jul 31, 2021

You can do most of these with a single line already (perhaps not as intuitive but possible)!

Using your example of
A = {'a': 1 2 3}
B = {'a': 1 3 4}

The 'OR' function to get {'a': 1 2 3 4} can be accomplished with
A.merge(B,how='outer',on='a')

The 'AND' function to get {'a': 1 3} can be accomplished with
A.merge(B,how='inner',on='a')

The 'XOR' function to get {'a': 2 4} would require a bit more manipulation, but still simple enough with a different approach imho
A.append(B).drop_duplicates(keep=False)

Not sure what taking the NOT of a dataframe would mean, but perhaps the .ne function would help? Unless you're talking about doing something like A.OR(NOT B).

'NOT' wouldn't work for the same reason it doesn't work on set. NOT in this case would be the absolute complement except we have no universal set U to take the actual set difference from.

@Delengowski
Copy link
Contributor

Delengowski commented Jul 31, 2021

I was thinking more A.not(B) would be like A[~A.isin(B)]. I just think it would be more intuitive that way.

This is set difference. The elements of A that are not shared with B i.e. relative complement.

@mroeschke mroeschke added the Needs Discussion Requires discussion from core team before further action label Aug 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action setops union, intersection, difference, symmetric_difference
Projects
None yet
Development

No branches or pull requests

6 participants