Skip to content

Remove cross-dataframe comparisons #242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

MarcoGorelli
Copy link
Contributor

@MarcoGorelli MarcoGorelli commented Aug 27, 2023

Just out of interest, has any here ever needed to sum two entire dataframes? I haven't, and don't think there's much use for these comparison ops when other is a dataframe

It also avoids some of the discussion in #224 - if it can't be done, there's no need to document that in some cases it might not be allowed

@rgommers
Copy link
Member

I did a quick search only for df == on projects I had a local clone of. A large majority of the hits are comparison with scalars. I did find examples in both scikit-learn here and pandas here.

I guess the main questions I have are:

  • doesn't the same thing apply to the corresponding methods of columns?
  • if a user does need a comparison or element-wise math operation, then how do they do it if we merge this PR?

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Aug 28, 2023

Thanks for taking a look

The pandas example is just in the release notes
And the scikit-learn one is just in a test, where they should probably be using pd.testing.assert_frame_equal
I'd be OK with adding testing functionality to the standard, performance is far less of a concern there anyway

Do we have any examples of downstream libraries using this in their core code?

if a user does need a comparison or element-wise math operation, then how do they do it if we merge this PR?

They do a join beforehand. The reason to not do a join under-the-hood for them is that like this, they only do the join once, and then can do all the comparisons maths they want

doesn't the same thing apply to the corresponding methods of columns?

If the columns come from the same dataframe, then no, it'll all work fine
If the columns come from different dataframes, then yes, the dataframes would need joining beforehand

Try expressing, using just sql syntax, the sum between col1 and col2. If they're from the same table:

select col1 + col2 as 'col1_col2_sum'
from df

If they're from different tables, then a join is required

@kkraus14
Copy link
Collaborator

I'm generally a +1 removing DataFrame.__eq__ and other binary operators against other DataFrames, but given this constraint:

If the columns come from the same dataframe, then no, it'll all work fine
If the columns come from different dataframes, then yes, the dataframes would need joining beforehand

I think we need to work through how we handle Columns more generally, because our current design doesn't have this kind of information.

@MarcoGorelli
Copy link
Contributor Author

Agree that we need to work through how we handle columns more generally, and that needs doing independently of this PR

We OK to start with this one?

@rgommers
Copy link
Member

One thing I came to realize is that there are two types of dataframe usage:

  1. The main one: SQL-like, lots of high-level operations like merge/join/select/pivot/apply,
  2. A historically common (but waning in populariy) one: "2-D arrays with axis labels and masks"

Element-wise operations like == and + are for (2) rather than (1). This is the kind of usage scikit-learn has - essentially they're treating dataframes like arrays with labels, and internally they still use arrays and then re-attach labels before returning a dataframe.

@MarcoGorelli and I discussed this a bit yesterday, and concluded that this would be okay to drop as a mode of operation (and hence, dropping == and other element-wise operations as done in this PR). That use case isn't core to today's dataframe libraries, and scikit-learn & co will not actually need this since it's still the norm (and probably will always be more efficient) to unpack the data and then use an array library for any heavy lifting.

Probably this PR can be merged after tomorrow's meeting.

Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got more thumbs-up's in the meeting yesterday, and it seems like in general we'd like to revisit/remove the APIs that have now been identified as being problematic for lazy implementations. So in it goes. Thanks @MarcoGorelli and @kkraus14!

@rgommers rgommers merged commit a03dec8 into data-apis:main Sep 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants