Skip to content

Fix integral truediv and floordiv for pyarrow types with large divisors #56707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -776,6 +776,7 @@ Timezones
Numeric
^^^^^^^
- Bug in :func:`read_csv` with ``engine="pyarrow"`` causing rounding errors for large integers (:issue:`52505`)
- Bug in :meth:`Series.__floordiv__` and :meth:`Series.__truediv__` for :class:`ArrowDtype` with integral dtypes raising for large divisors (:issue:`56706`)
- Bug in :meth:`Series.__floordiv__` for :class:`ArrowDtype` with integral dtypes raising for large values (:issue:`56645`)
- Bug in :meth:`Series.pow` not filling missing values correctly (:issue:`55512`)

Expand Down
21 changes: 14 additions & 7 deletions pandas/core/arrays/arrow/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@

def cast_for_truediv(
arrow_array: pa.ChunkedArray, pa_object: pa.Array | pa.Scalar
) -> pa.ChunkedArray:
) -> tuple[pa.ChunkedArray, pa.Array | pa.Scalar]:
# Ensure int / int -> float mirroring Python/Numpy behavior
# as pc.divide_checked(int, int) -> int
if pa.types.is_integer(arrow_array.type) and pa.types.is_integer(
Expand All @@ -120,17 +120,24 @@ def cast_for_truediv(
# Intentionally not using arrow_array.cast because it could be a scalar
# value in reflected case, and safe=False only added to
# scalar cast in pyarrow 13.
return pc.cast(arrow_array, pa.float64(), safe=False)
return arrow_array
# In arrow, common type between integral and float64 is float64,
# but integral type is safe casted to float64, to mirror python
# and numpy, we want an unsafe cast, so we cast both operands to
# to float64 before invoking arrow.
return pc.cast(arrow_array, pa.float64(), safe=False), pc.cast(
pa_object, pa.float64(), safe=False
)

return arrow_array, pa_object

def floordiv_compat(
left: pa.ChunkedArray | pa.Array | pa.Scalar,
right: pa.ChunkedArray | pa.Array | pa.Scalar,
) -> pa.ChunkedArray:
# Ensure int // int -> int mirroring Python/Numpy behavior
# as pc.floor(pc.divide_checked(int, int)) -> float
converted_left = cast_for_truediv(left, right)
result = pc.floor(pc.divide(converted_left, right))
converted_left, converted_right = cast_for_truediv(left, right)
Copy link
Contributor Author

@rohanjain101 rohanjain101 Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#56677 also fixes this for floordiv, IMO that is a more suitable fix for floordiv, since cpython and numpy implement floordiv without going through a double, but this fix is similar to #56647, I changed it for floordiv in this PR, only because the current floordiv implementation for integral types goes through truediv.

This fix should still apply to truediv, since both numpy and python truediv does an unsafe cast to a double:

>>> a = pd.Series([18014398509481983], dtype="int64")
>>> b = pd.Series([18014398509481984], dtype="int64")
>>> a / b
0    1.0
dtype: float64
>>>

result = pc.floor(pc.divide(converted_left, converted_right))
if pa.types.is_integer(left.type) and pa.types.is_integer(right.type):
result = result.cast(left.type)
return result
Expand All @@ -142,8 +149,8 @@ def floordiv_compat(
"rsub": lambda x, y: pc.subtract_checked(y, x),
"mul": pc.multiply_checked,
"rmul": lambda x, y: pc.multiply_checked(y, x),
"truediv": lambda x, y: pc.divide(cast_for_truediv(x, y), y),
"rtruediv": lambda x, y: pc.divide(y, cast_for_truediv(x, y)),
"truediv": lambda x, y: pc.divide(*cast_for_truediv(x, y)),
"rtruediv": lambda x, y: pc.divide(*cast_for_truediv(y, x)),
"floordiv": lambda x, y: floordiv_compat(x, y),
"rfloordiv": lambda x, y: floordiv_compat(y, x),
"mod": NotImplemented,
Expand Down
18 changes: 18 additions & 0 deletions pandas/tests/extension/test_arrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -3246,6 +3246,24 @@ def test_arrow_floordiv_large_values():
tm.assert_series_equal(result, expected)


def test_arrow_true_division_large_divisor():
# GH 56706
a = pd.Series([0], dtype="int64[pyarrow]")
b = pd.Series([18014398509481983], dtype="int64[pyarrow]")
expected = pd.Series([0], dtype="float64[pyarrow]")
result = a / b
tm.assert_series_equal(result, expected)


def test_arrow_floor_division_large_divisor():
# GH 56706
a = pd.Series([0], dtype="int64[pyarrow]")
b = pd.Series([18014398509481983], dtype="int64[pyarrow]")
expected = pd.Series([0], dtype="int64[pyarrow]")
result = a // b
tm.assert_series_equal(result, expected)


def test_string_to_datetime_parsing_cast():
# GH 56266
string_dates = ["2020-01-01 04:30:00", "2020-01-02 00:00:00", "2020-01-03 00:00:00"]
Expand Down