Skip to content

BUG: failing tests in pyarrow with pandas nightly #52782

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
AlenkaF opened this issue Apr 19, 2023 · 3 comments
Closed
3 tasks done

BUG: failing tests in pyarrow with pandas nightly #52782

AlenkaF opened this issue Apr 19, 2023 · 3 comments
Labels
Arrow pyarrow functionality Bug Needs Info Clarification about behavior needed to assess issue Upstream issue Issue related to pandas dependency

Comments

@AlenkaF
Copy link

AlenkaF commented Apr 19, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pyarrow as pa
import pandas as pd
from datetime import datetime

arr = pa.array([datetime(1, 1, 1)], pa.timestamp('s', tz='America/New_York'))
table = pa.table({'a': arr})

table.to_pandas(safe=False)
#                                     a
# 0 1754-08-30 17:47:41.128654848-04:56
table.column('a').to_pandas(safe=False)
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "pyarrow/array.pxi", line 837, in pyarrow.lib._PandasConvertible.to_pandas
#     return self._to_pandas(options, categories=categories,
#   File "pyarrow/table.pxi", line 469, in pyarrow.lib.ChunkedArray._to_pandas
#     arr = pandas_dtype.__from_arrow__(self)
#   File "/Users/alenkafrim/repos/pyarrow-dev/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 848, in __from_arrow__
#     array = array.cast(pyarrow.timestamp(unit=self._unit), safe=True)
#   File "pyarrow/table.pxi", line 551, in pyarrow.lib.ChunkedArray.cast
#     return _pc().cast(self, target_type, safe=safe, options=options)
#   File "/Users/alenkafrim/repos/arrow/python/pyarrow/compute.py", line 400, in cast
#     return call_function("cast", [arr], options, memory_pool)
#   File "pyarrow/_compute.pyx", line 572, in pyarrow._compute.call_function
#     return func.call(args, options=options, memory_pool=memory_pool,
#   File "pyarrow/_compute.pyx", line 367, in pyarrow._compute.Function.call
#     result = GetResultValue(
#   File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
#     return check_status(status)
#   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
#     raise ArrowInvalid(message)
# pyarrow.lib.ArrowInvalid: Casting from timestamp[s, tz=America/New_York] to timestamp[ns] would result in out of bounds timestamp: -62135596800
# /Users/alenkafrim/repos/arrow/cpp/src/arrow/compute/exec.cc:920  kernel_->exec(kernel_ctx_, input, &output)
# /Users/alenkafrim/repos/arrow/cpp/src/arrow/compute/function.cc:276  executor->Execute(input, &listener)

arr = pa.array([1, 2, 3], pa.timestamp("ms", "Europe/Brussels"))
table = pa.table({"name": arr})
result = table.column("name").to_pandas()
result.name
# is None, but should be "name"

Issue Description

Some tests in PyArrow started failing about a day ago apache/arrow#35235. The conversion from pyarrow table to pandas for timestamps with timezones is working fine, but column conversion to Series fails if timezone is not None.

The tests are failing on the Arrow code freeze branch made at the beginning of this week. The same tests pass with latest pandas release version, but fail with the development version/ nightly.

We suspect this could be connected with #52677 but are not sure.

@mroeschke do you maybe have an idea of what could be triggering this?

Expected Behavior

The conversion from table column to pandas should not fail.

Installed Versions

INSTALLED VERSIONS

commit : d91b04c
python : 3.10.10.final.0
python-bits : 64
OS : Darwin
OS-release : 22.4.0
Version : Darwin Kernel Version 22.4.0: Mon Mar 6 21:00:41 PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T8103
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.0.dev0+548.gd91b04c30e
numpy : 1.21.6
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 67.2.0
pip : 23.0.1
Cython : 0.29.33
pytest : 7.2.2
hypothesis : 6.70.0
sphinx : 5.3.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.11.0
pandas_datareader: None
bs4 : 4.12.0
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2023.3.0
gcsfs : None
matplotlib : 3.7.1
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 12.0.0.dev418+g6432a2382e.d20230414
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : 2023.2.0
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None

@AlenkaF AlenkaF added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 19, 2023
@mroeschke
Copy link
Member

array = array.cast(pyarrow.timestamp(unit=self._unit), safe=True)

From this line, looks like #52201 was the change.

I guess this is tricky because the __from_arrow__ can essentially ignore the safe=False in the to_pandas call. I don't think it would be good to end up with 1754-08-30 17:47:41.128654848-04:56 even if safe=False?

@mroeschke
Copy link
Member

is None, but should be "name"

Guessing this also is due to going through __from_arrow__ now, but since __from_arrow__ does not have access to the column name I'm guessing this is a pyarrow issue?

@mroeschke mroeschke added Needs Info Clarification about behavior needed to assess issue Upstream issue Issue related to pandas dependency Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 19, 2023
@AlenkaF
Copy link
Author

AlenkaF commented Apr 20, 2023

From this line, looks like #52201 was the change.

Thank you for this info, it is correct! 👍

I guess this is tricky because the from_arrow can essentially ignore the safe=False in the to_pandas call. I don't think it would be good to end up with 1754-08-30 17:47:41.128654848-04:56 even if safe=False?

After talking to Joris we think this is actually connected to the fact that in pyarrow we still cast the datetime to ns if the resolution differ (apache/arrow#34462) and so pandas receives the datetime with the ns resolution and does the casting which isn't needed anymore (https://github.com/tswast/pandas/blob/8c509257eaee411fbbe9a7e696bfad3d01bb1e1f/pandas/core/dtypes/dtypes.py#L845) and which fail for out of bounds values.

Guessing this also is due to going through __from_arrow__ now, but since __from_arrow__ does not have access to the column name I'm guessing this is a pyarrow issue?

Correct.

I will close this issue as the work to update the code after #52201 needs to be done on pyarrow side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug Needs Info Clarification about behavior needed to assess issue Upstream issue Issue related to pandas dependency
Projects
None yet
Development

No branches or pull requests

2 participants