BUG: `Series[int]` coerces `pyarrow` types to plain python scalars (but not `numpy`) #56021

randolf-scholz · 2023-11-17T10:21:10Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
import pyarrow as pa

s1 = pd.Series([1,2,3], dtype=np.int32)
assert isinstance(s1[0], np.int32)  # ✅
s2 = pd.Series([1,2,3], dtype="int32[pyarrow]")
assert isinstance(s2[0], pa.lib.Int32Scalar)  # ❌ (actually: python int)

Issue Description

When using the numpy backend, we get numpy scalars. When using the pyarrow backend, we get python scalars?

Expected Behavior

It's a fundamental expectation of vector data that __getitem__ has a signature like (self: Vector[T], index: int) -> T. So I'd expect to get pyarrow scalars when using the pyarrow backend. In any case, there should be consistent behavior between the different backends.

Installed Versions

INSTALLED VERSIONS

commit : 2a953cf
python : 3.11.6.final.0
python-bits : 64
OS : Linux
OS-release : 6.2.0-36-generic
Version : #37~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct 9 15:34:04 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.3
numpy : 1.26.1
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.3.1
Cython : 3.0.4
pytest : 7.4.3
hypothesis : None
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : None
pymysql : 1.4.6
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.16.1
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : 2023.10.0
fsspec : 2023.10.0
gcsfs : None
matplotlib : 3.8.0
numba : None
numexpr : 2.8.7
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
sqlalchemy : 1.4.49
tables : 3.9.1
tabulate : 0.9.0
xarray : 2023.10.1
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

randolf-scholz · 2023-11-17T10:32:33Z

There are also some secondary consequences due to this coercion:

assert (s1 + s1[0]).dtype == s1.dtype  # ✅ still np.int32
assert (s2 + s2[0]).dtype == s2.dtype  # ❌ now int64[pyarrow]

jorisvandenbossche · 2023-11-17T13:48:51Z

Personally I do think we should not return pyarrow scalars. In any case, this is not a bug but a API design discussion (the .as_py() call in ArrowExtensionArray.__getitem__ was done on purpose)

The fact that it gives wrong result dtypes in the arithmetic example you give could also be seen as a bug to be fixed. Because also when using a python scalar you would get an unnecessary upcast:

print((s1 + 1).dtype)   # still np.int32
print((s2 + 1).dtype)   # now int64[pyarrow]

pyarrow doesn't do much inference / promotion on the scalar you pass, it simply coerces it to its default type (in this case int64) and then promotes the other based on that. But this could be something that pyarrow could improve for python scalars (giving priority to the type of the array operand, like numpy does), or something to handle on our side as long as pyarrow doesn't do that.

randolf-scholz · 2023-11-17T14:02:28Z

Personally I do think we should not return pyarrow scalars.

From a type-perspective, how should the return type of Series.__getitem__ relate to the data type of the series, in your opinion? Personally, I find it incredibly confusing and baroque if every data type gets their own special rules, instead of the simply and intuitive __getitem__(self: Series[T], index: int) -> T.

As someone who has contributed to pandas-stubs I can tell that all these special rules make it very complicated and brittle.

randolf-scholz added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 17, 2023

jorisvandenbossche added API Design Arrow pyarrow functionality and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: `Series[int]` coerces `pyarrow` types to plain python scalars (but not `numpy`) #56021

BUG: `Series[int]` coerces `pyarrow` types to plain python scalars (but not `numpy`) #56021

randolf-scholz commented Nov 17, 2023

INSTALLED VERSIONS

randolf-scholz commented Nov 17, 2023

jorisvandenbossche commented Nov 17, 2023

randolf-scholz commented Nov 17, 2023 •

edited

Loading

BUG: Series[int] coerces pyarrow types to plain python scalars (but not numpy) #56021

BUG: Series[int] coerces pyarrow types to plain python scalars (but not numpy) #56021

Comments

randolf-scholz commented Nov 17, 2023

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

randolf-scholz commented Nov 17, 2023

jorisvandenbossche commented Nov 17, 2023

randolf-scholz commented Nov 17, 2023 • edited Loading

BUG: `Series[int]` coerces `pyarrow` types to plain python scalars (but not `numpy`) #56021

BUG: `Series[int]` coerces `pyarrow` types to plain python scalars (but not `numpy`) #56021

randolf-scholz commented Nov 17, 2023 •

edited

Loading