Skip to content

API: flex comparisons DataFrame vs Series inconsistency #28079

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbrockmendel opened this issue Aug 21, 2019 · 4 comments · Fixed by #28601
Closed

API: flex comparisons DataFrame vs Series inconsistency #28079

jbrockmendel opened this issue Aug 21, 2019 · 4 comments · Fixed by #28601
Labels
API Design Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@jbrockmendel
Copy link
Member

In tests.frame.test_arithmetic we test the DataFrame part of the following, but we do not test the Series behavior (at least not in that file)

arr = np.array([np.nan, 1, 6, np.nan])
arr2 = np.array([2j, np.nan, 7, None])
ser = pd.Series(arr)
ser2 = pd.Series(arr2)
df = pd.DataFrame(ser)
df2 = pd.DataFrame(ser2)

ser < ser2  # raises TypeError, makes sense
df < df2  # raises TypeError, makes sense

ser.lt(ser2)  # raises TypeError, makes sense

>>> df.lt(df2)
       0
0  False
1  False
2   True
3  False

The df.lt(df2) version (the "flex" op) masks positions where either df or df2 is null. That is fine, but why doesn't the Series version do the same thing?

@jbrockmendel jbrockmendel added API Design Numeric Operations Arithmetic, Comparison, and Logical operations labels Aug 21, 2019
@jbrockmendel
Copy link
Member Author

@jreback @jorisvandenbossche @TomAugspurger this won't be hard to fix, needs input on what the desired behavior is.

@TomAugspurger
Copy link
Contributor

What's the proposal here? To always mask before doing the op, so that ser.lt(ser2) doesn't raise?

@jbrockmendel
Copy link
Member Author

Two options:

a) make Series.lt behave like DataFrame.lt
b) make DataFrame.lt behave like Series.lt

The main argument for a) is that it is less breaking for users relying on the DataFrame behavior.
The main argument for b) is that the behavior only matters in exactly 1 test, and it is a test with truly weird complex-dtype behavior (#28050)

@TomAugspurger
Copy link
Contributor

OK, I'm not sure what's best here. My inclination is to follow NumPy where possible, which I think argues for option a, right? (I'm ignoring the None, which forces object dtype. Maybe I shouldn't ignore that).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants