Skip to content

BUG: Series[int] coerces pyarrow types to plain python scalars (but not numpy) #56021

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
randolf-scholz opened this issue Nov 17, 2023 · 3 comments
Open
2 of 3 tasks
Labels
API Design Arrow pyarrow functionality

Comments

@randolf-scholz
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
import pyarrow as pa

s1 = pd.Series([1,2,3], dtype=np.int32)
assert isinstance(s1[0], np.int32)  # ✅
s2 = pd.Series([1,2,3], dtype="int32[pyarrow]")
assert isinstance(s2[0], pa.lib.Int32Scalar)  # ❌ (actually: python int)

Issue Description

When using the numpy backend, we get numpy scalars. When using the pyarrow backend, we get python scalars?

Expected Behavior

It's a fundamental expectation of vector data that __getitem__ has a signature like (self: Vector[T], index: int) -> T. So I'd expect to get pyarrow scalars when using the pyarrow backend. In any case, there should be consistent behavior between the different backends.

Installed Versions

INSTALLED VERSIONS

commit : 2a953cf
python : 3.11.6.final.0
python-bits : 64
OS : Linux
OS-release : 6.2.0-36-generic
Version : #37~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct 9 15:34:04 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.3
numpy : 1.26.1
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.3.1
Cython : 3.0.4
pytest : 7.4.3
hypothesis : None
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : None
pymysql : 1.4.6
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.16.1
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : 2023.10.0
fsspec : 2023.10.0
gcsfs : None
matplotlib : 3.8.0
numba : None
numexpr : 2.8.7
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
sqlalchemy : 1.4.49
tables : 3.9.1
tabulate : 0.9.0
xarray : 2023.10.1
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@randolf-scholz randolf-scholz added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 17, 2023
@randolf-scholz
Copy link
Contributor Author

There are also some secondary consequences due to this coercion:

assert (s1 + s1[0]).dtype == s1.dtype  # ✅ still np.int32
assert (s2 + s2[0]).dtype == s2.dtype  # ❌ now int64[pyarrow]

@jorisvandenbossche jorisvandenbossche added API Design Arrow pyarrow functionality and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 17, 2023
@jorisvandenbossche
Copy link
Member

Personally I do think we should not return pyarrow scalars. In any case, this is not a bug but a API design discussion (the .as_py() call in ArrowExtensionArray.__getitem__ was done on purpose)

The fact that it gives wrong result dtypes in the arithmetic example you give could also be seen as a bug to be fixed. Because also when using a python scalar you would get an unnecessary upcast:

print((s1 + 1).dtype)   # still np.int32
print((s2 + 1).dtype)   # now int64[pyarrow]

pyarrow doesn't do much inference / promotion on the scalar you pass, it simply coerces it to its default type (in this case int64) and then promotes the other based on that. But this could be something that pyarrow could improve for python scalars (giving priority to the type of the array operand, like numpy does), or something to handle on our side as long as pyarrow doesn't do that.

@randolf-scholz
Copy link
Contributor Author

randolf-scholz commented Nov 17, 2023

Personally I do think we should not return pyarrow scalars.

From a type-perspective, how should the return type of Series.__getitem__ relate to the data type of the series, in your opinion? Personally, I find it incredibly confusing and baroque if every data type gets their own special rules, instead of the simply and intuitive __getitem__(self: Series[T], index: int) -> T.

As someone who has contributed to pandas-stubs I can tell that all these special rules make it very complicated and brittle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Arrow pyarrow functionality
Projects
None yet
Development

No branches or pull requests

2 participants