-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Fix integral truediv and floordiv for pyarrow types with large divisor and avoid floating points for floordiv #56677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/arrays/arrow/array.py
Outdated
divided, | ||
) | ||
# Ensure compatibility with older versions of pandas where | ||
# int8 // int64 returned int8 rather than int64. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something that can be changed? If so, the cast to left.type can be removed, result.type is guaranteed to already be integral.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually don't think this old behavior is desirable, for example:
>>> a = pd.Series([-128], dtype="int8[pyarrow]")
>>> b = pd.Series([-1], dtype="int64[pyarrow]")
>>> a // b
With this cast, this operation fails with overflow error, because 128 can't fit in an int8. In numpy, it looks like this operation promotes to common type of int64:
>>> a = pd.Series([-128], dtype="int8")
>>> b = pd.Series([-1], dtype="int64")
>>> a // b
0 128
dtype: int64
>>>
So I think this cast should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there's comments in the test about ArrowExtensionArray not promoting, so maybe this is the intended behavior? Restored the cast.
pandas/core/arrays/arrow/array.py
Outdated
divided = pc.divide(left, right) | ||
if pa.types.is_integer(divided.type): | ||
# GH 56676, avoid storing intermediate calculating in floating point type. | ||
has_remainder = pc.not_equal(pc.multiply(divided, right), left) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO floordiv would be a useful compute kernel in arrow, especially since python and numpy both support it. I opened apache/arrow#39386, if added would make this code much simpler (and probably perform better).
41fd8b5
to
167507e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is reasonable. Of course the pyarrow compute kernel would be ideal, but I would be OK with this for now and we can always switch to that kernel as it becomes available.
@mroeschke any thoughts?
pandas/core/arrays/arrow/array.py
Outdated
pc.and_( | ||
has_remainder, | ||
pc.less( | ||
pc.bit_wise_xor(left, right), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a deal breaker but instead of the bitwise xor can is it easy to just xor comparisons of each argument being < 0? I would be willing to sacrifice some optimization potential here for readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah great suggestion - yea I think that is a viable compromise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I can change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a way to scope that so it only evaluates when the operands are not unsigned that would be better as well. You have some very nice tricks here, but not every Python developer may know the impacts of bitwise operations like this. So just want to make sure it doesnt get refactored into something unintended in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a way to scope that so it only evaluates when the operands are not unsigned that would be better as well.
I am not aware of this, pc.if_else from what I can tell, even if the condition is all true, or all false, evaluates both conditions, since it takes an array of values, but not an expression, so I'm not sure there's a way to avoid evaluating it for both sides. If there's not a way to do this, it may be a useful feature for arrow, as this seems like an optimization that would benefit many use cases.
Agreed, however, even in the best case if this is added in arrow 15 (next major release), given that pandas supports a minimum of pyarrow 10, it may be a while before the minimum pyarrow version supported by pandas has a floordiv kernel, so I think this code would still be useful for the near future. |
This reverts commit 47c4474.
Yeah I'm OK with this workaround until pyarrow may introduce a floordiv kernel we can replace this with |
Thanks for identifying and fixing these @rohanjain101 |
…yarrow types with large divisor and avoid floating points for floordiv
… for pyarrow types with large divisor and avoid floating points for floordiv) (#56744) Backport PR #56677: Fix integral truediv and floordiv for pyarrow types with large divisor and avoid floating points for floordiv Co-authored-by: rohanjain101 <[email protected]>
…r and avoid floating points for floordiv (pandas-dev#56677) * avoid floating points for integral floor division * comment * typo * gh reference * improve test * fix comment * cleanup * cleanup * whatsnew * remove assert * revert * simplify comment * simplify logic * simplify test * fix uint64 overflow * bug fix * fix overflow * fix comment * fix truediv for large divsor * improve unsigned // unsigned * improve test cases * negative operand condition * Revert "negative operand condition" This reverts commit 47c4474. * fix readability * cleanup comments --------- Co-authored-by: Rohan Jain <[email protected]>
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.