-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Use align_method in comp_method_FRAME #22880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @jbrockmendel! Thanks for submitting the PR.
|
Related: #22355. |
@jorisvandenbossche this is the "pretty" alternative to #22751, needs attention despite not being green. Any thoughts on desired behavior? |
@pandas-dev/pandas-core this now contains a non-trivial API change, should make sure there is consensus before moving forward. TL;DR: broadcasting/alignment on DataFrame comparison ops behaves differently from arithmetic ops (#22355). Both behave differently from numpy convention (#22686). This PR changes the comparison behavior to match the arithmetic behavior, with the numpy part saved for the next round. Demonstration (based on tests.frame.test_operators test_boolean_comparison in master):
|
Just to make sure I understand, the changes would be
Did I miss any? |
list or tuple with len == df.shape[0] != df.shape[1]: previously broadcasts, now raises <-- AFAICT this behavior is not currently tested list or tuple with len == df.shape[1]: previously raises, now broadcasts * In a separate PR I would like to change the behavior when operating with ndarray with shape (nrows, 1) or (1, ncols) so that both of these broadcast identically to numpy (ATM the (nrows, 1) case raises for both arithmetic and comparison ops). This change would be for both arithmetic and comparison ops (whereas this PR just changes comparison ops to behave like arithmetic ops) |
Update: In a different branch I edited the alignment code so that 2d arrays of (nrows, 1) or (1, ncols) both broadcast for arithmetic ops. This did not cause any test failures. So AFAICT this particular behavior may be unintentional. update^2 #23000 has now implemented this behavior for arithmetic ops |
Codecov Report
@@ Coverage Diff @@
## master #22880 +/- ##
==========================================
+ Coverage 92.21% 92.24% +0.03%
==========================================
Files 169 169
Lines 50884 50814 -70
==========================================
- Hits 46922 46874 -48
+ Misses 3962 3940 -22
Continue to review full report at Codecov.
|
you refernced #4576 which is indeed behavior we patched in the past, but doesn't necessary make sense for a DataFrame, while for a homogenous ndarray does a bit more. Have to look at your changes in some detail.. |
Sure. The succinct version is my favorite: "Make DataFrame comparison ops behave like DataFrame arithmetic ops". That change is enabled entirely by the 1 line added in |
Extra benefit: After we get rid of Block.eval, it looks like _try_coerce_args can be simplified. |
pandas/tests/frame/test_operators.py
Outdated
|
||
result = df > b_r | ||
result = df > b_r # broadcasts like ndarray (GH#23000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed following #23000 (previously in this PR this comparison raised ValueError)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put comments on the previous line
pandas/tests/frame/test_operators.py
Outdated
|
||
result = df == b_r | ||
result = df == b_r # broadcasts like ndarray (GH#23000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed following #23000 (previously in this PR this comparison raised ValueError)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments on previous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a merge conflict with master now.
doc/source/whatsnew/v0.24.0.txt
Outdated
operations has been changed to match the arithmetic operations in these cases. | ||
(:issue:`22880`) | ||
|
||
The affected cases are: operating against a 2-dimensional ``np.ndarray`` with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slight preference for formatting these cases as a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pushed addressing your comments. This one I had to guess what you had in mind. LMK if I got it wrong.
doc/source/whatsnew/v0.24.0.txt
Outdated
In [3]: arr = np.arange(6).reshape(3, 2) | ||
In [4]: df = pd.DataFrame(arr) | ||
|
||
In [5]: df == arr[[0], :] # comparison used to broadcast where arithmetic would raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"used to" -> "previously" to avoid ambiguity with "used to".
doc/source/whatsnew/v0.24.0.txt
Outdated
... | ||
ValueError: Unable to coerce to DataFrame, shape must be (3, 2): given (1, 2) | ||
|
||
In [7]: df == (1, 2) # length matches number of columns; comparison used to raise where arithmetic would broadcast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These code blocks are put in <pre>
divs in the docs, so long lines will force horizontal scrolling / look extra bad on mobile. Better to continue the input with a
In [7]: ...
...: # comment ...
...: # maybe multiline
|
||
*Current Behavior*: | ||
|
||
.. ipython:: python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will nee d an :okexcept
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can also remove the In []
prefixes. And hopefully the block continues running when there's an exception in an okexcept
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
big issue I have here is what is the performance degredation. .eval dispatched to numexpr, I think that this is now handled in dispatch_to_* but not really sure. We may need some asvs to test the broadcasting behaviors (e.g. df + 2-d ndarray)
doc/source/whatsnew/v0.24.0.txt
Outdated
The affected cases are: | ||
- operating against a 2-dimensional ``np.ndarray`` with either 1 row or 1 column | ||
- a list or tuple with the same length matching the number of rows in the :class:`DataFrame` | ||
- a list or tuple with thlength matching the number of columns in the :class:`DataFrame`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the length
doc/source/whatsnew/v0.24.0.txt
Outdated
df = pd.DataFrame(arr) | ||
df == arr[[0], :] # raises just like the next arithmetic operation | ||
df + arr[[0], :] | ||
df == (1, 2) # broadcasts just like the next arithmetic operation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would prefer that the comments precede the lines, also use multiple ipython blocks to make the rendering more clear (if needed).
pandas/tests/frame/test_operators.py
Outdated
|
||
result = df > b_r | ||
result = df > b_r # broadcasts like ndarray (GH#23000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put comments on the previous line
pandas/tests/frame/test_operators.py
Outdated
|
||
result = df == b_r | ||
result = df == b_r # broadcasts like ndarray (GH#23000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments on previous
@jbrockmendel did you do a check of the performance to be sure there is no implication? |
There does appear to be a performance hit in the relevant case, but a) there is room for subsequent optimization and b) the bugs this avoids are much more important than the performance impact.
|
gentle ping. If this (or #23031) goes through the follow-ups will keep me busy in places orthogonal to PeriodArray constructors |
Quick general feedback: for the cases that worked before but will raise now, I think we should go through a deprecation cycle. (looking now more in detail on your explanation of the changes above) |
How do you see this performance difference? I though this would now raise an error on this PR? So if I understand correctly, some cases that now work, will raise after this PR, but you plan to let them work again in a later PR? (it's not possible to for example first align arithmetic ops with numpy, and then comparison ops with arithmeric/numpy, so you don't have this back and forth?) Something else to consider: for Index we actually deliberately disabled comparing with a length 1 array. See also the recently opened issue #23078 where we were planning to actually raise instead of broadcast |
doc/source/whatsnew/v0.24.0.txt
Outdated
The affected cases are: | ||
- operating against a 2-dimensional ``np.ndarray`` with either 1 row or 1 column | ||
- a list or tuple with length matching the number of rows in the :class:`DataFrame` | ||
- a list or tuple with length matching the number of columns in the :class:`DataFrame`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this list is not valid rst formatted: you don't need to indentation, and need an empty line before the list
doc/source/whatsnew/v0.24.0.txt
Outdated
(:issue:`22880`) | ||
|
||
The affected cases are: | ||
- operating against a 2-dimensional ``np.ndarray`` with either 1 row or 1 column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you indicate here what changes for those cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated; I think other whatsnew comments addressed. Note that Previous Behavior blocks refer to 0.23, not pre-this-PR
If you drill down into this, it is driven by poor performance of cmp(DataFrame, Series), which I think can be improved on a subsequent pass. In the interim I see this as a uncommon corner case and the performance hit as being less important than internal consistency and bugfixes (just noticed this clsoes #20090)
That was the case when this PR was opened, but since then #22355 changed the behavior of op(DataFrame, ndarray[ndim=2]) to match numpy broadcasting behavior.
If I understand correctly, this is similar to what I have in mind for optimizing the DataFrame vs Series op.
I think the only case that falls into this category is a list/tuple (not ndarray or Series) with I would be open to subsequently changing the behavior of both arithmetic+comparisons to do something like:
but only if this is made consistent across arithmetic/comparison ops. |
But then the whatsnew addition you have in this PR is not up to date anymore? As that still mentions that this case will now raise an error |
Will update. |
Can you please give a new summary of what this PR does / changes etc, as I am rather lost now. The comment above about "non-trivial API change" is also not relevant any more? (and update the top post) |
Updated above. Bottom line: This PR makes DataFrame comparison ops broadcasting consistent with DataFrame arithmetic ops. |
That's an open issue, not a merged PR? |
Updated, #23000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. some minor comments. ping on green.
doc/source/whatsnew/v0.24.0.txt
Outdated
df = pd.DataFrame(arr) | ||
|
||
.. ipython:: python | ||
# comparison and arithmetic both broadcast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer you make the comments actual sentences here, which is pretty easy when you already have the blocks.
@@ -48,15 +48,19 @@ def test_mixed_comparison(self): | |||
assert result.all().all() | |||
|
|||
def test_df_boolean_comparison_error(self): | |||
# GH 4576 | |||
# GH#4576, GH#22880 | |||
# boolean comparisons with a tuple/list give unexpected results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you change this comment
pandas/tests/frame/test_operators.py
Outdated
|
||
result = df > tup | ||
assert_frame_equal(result, expected) | ||
with pytest.raises(ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you assert the error messages in these
ping |
This reverts commit e96c691.
OK, so I simply reverted the merge. That seemed the easiest way to continue the discussion. It's a bit annoying, but @jbrockmendel you will need to open a new PR since this one cannot be re-opened .. |
…ndas-dev#23120) This reverts commit e96c691.
Closes #20090
update Since #23000 was merged, some of the discussion is out of date. The bottom line remains unchanged: This PR makes DataFrame comparison ops behave like DataFrame arithmetic ops currently do. Also fixes some bugs e.g. #20090
end update
This is a much nicer alternative to the implementation in #22751. The problem is that two tests still fail with this implementation. We need to pin down the design spec more explicitly regardless.
core.ops has three functions for defining DataFrame ops: _arith_method_FRAME, _flex_comp_method_FRAME, _comp_method_FRAME. The first two both call
_align_method_FRAME
, with_comp_method_FRAME
being the outlier. This PR just adds that alignment call.The two tests that currently fail:
I understand why the behavior tested by the first test makes sense, but don't see the logic behind having
df == (2, 2)
raise (maybe #4576 holds the answer, will look at that more closely)