Skip to content

Fix integral truediv and floordiv for pyarrow types with large divisor and avoid floating points for floordiv #56677

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Jan 5, 2024

Conversation

rohanjain101
Copy link
Contributor

@rohanjain101 rohanjain101 commented Dec 29, 2023

divided,
)
# Ensure compatibility with older versions of pandas where
# int8 // int64 returned int8 rather than int64.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that can be changed? If so, the cast to left.type can be removed, result.type is guaranteed to already be integral.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't think this old behavior is desirable, for example:

>>> a = pd.Series([-128], dtype="int8[pyarrow]")
>>> b = pd.Series([-1], dtype="int64[pyarrow]")
>>> a // b

With this cast, this operation fails with overflow error, because 128 can't fit in an int8. In numpy, it looks like this operation promotes to common type of int64:

>>> a = pd.Series([-128], dtype="int8")
>>> b = pd.Series([-1], dtype="int64")
>>> a // b
0    128
dtype: int64
>>>

So I think this cast should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's comments in the test about ArrowExtensionArray not promoting, so maybe this is the intended behavior? Restored the cast.

divided = pc.divide(left, right)
if pa.types.is_integer(divided.type):
# GH 56676, avoid storing intermediate calculating in floating point type.
has_remainder = pc.not_equal(pc.multiply(divided, right), left)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO floordiv would be a useful compute kernel in arrow, especially since python and numpy both support it. I opened apache/arrow#39386, if added would make this code much simpler (and probably perform better).

@rohanjain101 rohanjain101 changed the title Avoid floating points for integral floordiv for pyarrow types Fix integral truediv and floordiv for pyarrow types and avoid floating points for floordiv Jan 3, 2024
@rohanjain101 rohanjain101 changed the title Fix integral truediv and floordiv for pyarrow types and avoid floating points for floordiv Fix integral truediv and floordiv for pyarrow types with large divisor and avoid floating points for floordiv Jan 3, 2024
@WillAyd WillAyd added the Arrow pyarrow functionality label Jan 3, 2024
@rohanjain101 rohanjain101 requested a review from WillAyd January 4, 2024 19:58
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is reasonable. Of course the pyarrow compute kernel would be ideal, but I would be OK with this for now and we can always switch to that kernel as it becomes available.

@mroeschke any thoughts?

pc.and_(
has_remainder,
pc.less(
pc.bit_wise_xor(left, right),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a deal breaker but instead of the bitwise xor can is it easy to just xor comparisons of each argument being < 0? I would be willing to sacrifice some optimization potential here for readability

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah great suggestion - yea I think that is a viable compromise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I can change this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a way to scope that so it only evaluates when the operands are not unsigned that would be better as well. You have some very nice tricks here, but not every Python developer may know the impacts of bitwise operations like this. So just want to make sure it doesnt get refactored into something unintended in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a way to scope that so it only evaluates when the operands are not unsigned that would be better as well.

I am not aware of this, pc.if_else from what I can tell, even if the condition is all true, or all false, evaluates both conditions, since it takes an array of values, but not an expression, so I'm not sure there's a way to avoid evaluating it for both sides. If there's not a way to do this, it may be a useful feature for arrow, as this seems like an optimization that would benefit many use cases.

@rohanjain101
Copy link
Contributor Author

Of course the pyarrow compute kernel would be ideal, but I would be OK with this for now and we can always switch to that kernel as it becomes available.

Agreed, however, even in the best case if this is added in arrow 15 (next major release), given that pandas supports a minimum of pyarrow 10, it may be a while before the minimum pyarrow version supported by pandas has a floordiv kernel, so I think this code would still be useful for the near future.

@mroeschke
Copy link
Member

Yeah I'm OK with this workaround until pyarrow may introduce a floordiv kernel we can replace this with

@rohanjain101 rohanjain101 requested a review from mroeschke January 5, 2024 02:30
@mroeschke mroeschke added this to the 2.2 milestone Jan 5, 2024
@mroeschke mroeschke merged commit a311f77 into pandas-dev:main Jan 5, 2024
@mroeschke
Copy link
Member

Thanks for identifying and fixing these @rohanjain101

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Jan 5, 2024
…yarrow types with large divisor and avoid floating points for floordiv
mroeschke pushed a commit that referenced this pull request Jan 5, 2024
… for pyarrow types with large divisor and avoid floating points for floordiv) (#56744)

Backport PR #56677: Fix integral truediv and floordiv for pyarrow types with large divisor and avoid floating points for floordiv

Co-authored-by: rohanjain101 <[email protected]>
@rohanjain101 rohanjain101 deleted the floordiv_integral branch January 5, 2024 22:44
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
…r and avoid floating points for floordiv (pandas-dev#56677)

* avoid floating points for integral floor division

* comment

* typo

* gh reference

* improve test

* fix comment

* cleanup

* cleanup

* whatsnew

* remove assert

* revert

* simplify comment

* simplify logic

* simplify test

* fix uint64 overflow

* bug fix

* fix overflow

* fix comment

* fix truediv for large divsor

* improve unsigned // unsigned

* improve test cases

* negative operand condition

* Revert "negative operand condition"

This reverts commit 47c4474.

* fix readability

* cleanup comments

---------

Co-authored-by: Rohan Jain <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality
Projects
None yet
3 participants