Skip to content

Fix integral truediv and floordiv for pyarrow types with large divisor and avoid floating points for floordiv #56677

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Jan 5, 2024
Merged
24 changes: 19 additions & 5 deletions pandas/core/arrays/arrow/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,12 +127,26 @@ def floordiv_compat(
left: pa.ChunkedArray | pa.Array | pa.Scalar,
right: pa.ChunkedArray | pa.Array | pa.Scalar,
) -> pa.ChunkedArray:
# Ensure int // int -> int mirroring Python/Numpy behavior
# as pc.floor(pc.divide_checked(int, int)) -> float
converted_left = cast_for_truediv(left, right)
result = pc.floor(pc.divide(converted_left, right))
if pa.types.is_integer(left.type) and pa.types.is_integer(right.type):
divided = pc.divide(left, right)
if pa.types.is_integer(divided.type):
# GH 56676, avoid storing intermediate calculating in floating point type.
has_remainder = pc.not_equal(pc.multiply(divided, right), left)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO floordiv would be a useful compute kernel in arrow, especially since python and numpy both support it. I opened apache/arrow#39386, if added would make this code much simpler (and probably perform better).

result = pc.if_else(
pc.less(divided, 0),
# Avoid using subtract_checked which has significantly worse perf:
# https://github.com/apache/arrow/issues/37678, especially since
# integer overflow should be impossible here.
pc.if_else(has_remainder, pc.subtract(divided, 1), divided),
divided,
)
# Ensure compatibility with older versions of pandas where
# int8 // int64 returned int8 rather than int64.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that can be changed? If so, the cast to left.type can be removed, result.type is guaranteed to already be integral.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't think this old behavior is desirable, for example:

>>> a = pd.Series([-128], dtype="int8[pyarrow]")
>>> b = pd.Series([-1], dtype="int64[pyarrow]")
>>> a // b

With this cast, this operation fails with overflow error, because 128 can't fit in an int8. In numpy, it looks like this operation promotes to common type of int64:

>>> a = pd.Series([-128], dtype="int8")
>>> b = pd.Series([-1], dtype="int64")
>>> a // b
0    128
dtype: int64
>>>

So I think this cast should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's comments in the test about ArrowExtensionArray not promoting, so maybe this is the intended behavior? Restored the cast.

result = result.cast(left.type)
else:
assert pa.types.is_floating(divided.type) or pa.types.is_decimal(
divided.type
)
result = pc.floor(divided)
return result

ARROW_ARITHMETIC_FUNCS = {
Expand Down
8 changes: 8 additions & 0 deletions pandas/tests/extension/test_arrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -3246,6 +3246,14 @@ def test_arrow_floordiv_large_values():
tm.assert_series_equal(result, expected)


def test_arrow_floordiv_large_integral_result():
# GH 56676
a = pd.Series([18014398509481983], dtype="int64[pyarrow]")
expected = pd.Series([18014398509481983], dtype="int64[pyarrow]")
result = a // 1
tm.assert_series_equal(result, expected)


def test_string_to_datetime_parsing_cast():
# GH 56266
string_dates = ["2020-01-01 04:30:00", "2020-01-02 00:00:00", "2020-01-03 00:00:00"]
Expand Down