Skip to content

BUG: Passing pyarrow string array + dtype to pd.Series throws ArrowInvalidError on 2.0rc #51844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
ivirshup opened this issue Mar 8, 2023 · 0 comments · Fixed by #52076
Closed
2 of 3 tasks
Labels
Arrow pyarrow functionality Regression Functionality that used to work in a prior pandas version Strings String extension data type and string data

Comments

@ivirshup
Copy link
Contributor

ivirshup commented Mar 8, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd, pyarrow as pa

pd.Series(pa.array("the quick brown fox".split()), dtype="string")

# ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True

Issue Description

The example above errors when it probably shouldn't. Here's the example + traceback:

import pandas as pd, pyarrow as pa

pd.Series(pa.array("the quick brown fox".split()), dtype="string")
# ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True
traceback
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[1], line 3
      1 import pandas as pd, pyarrow as pa
----> 3 pd.Series(pa.array("the quick brown fox".split()), dtype="string")

File ~/miniconda3/envs/pandas-2.0/lib/python3.10/site-packages/pandas/core/series.py:490, in Series.__init__(self, data, index, dtype, name, copy, fastpath)
    488         data = data.copy()
    489 else:
--> 490     data = sanitize_array(data, index, dtype, copy)
    492     manager = get_option("mode.data_manager")
    493     if manager == "block":

File ~/miniconda3/envs/pandas-2.0/lib/python3.10/site-packages/pandas/core/construction.py:565, in sanitize_array(data, index, dtype, copy, allow_2d)
    563     _sanitize_non_ordered(data)
    564     cls = dtype.construct_array_type()
--> 565     subarr = cls._from_sequence(data, dtype=dtype, copy=copy)
    567 # GH#846
    568 elif isinstance(data, np.ndarray):

File ~/miniconda3/envs/pandas-2.0/lib/python3.10/site-packages/pandas/core/arrays/string_.py:359, in StringArray._from_sequence(cls, scalars, dtype, copy)
    355     result[na_values] = libmissing.NA
    357 else:
    358     # convert non-na-likes to str, and nan-likes to StringDtype().na_value
--> 359     result = lib.ensure_string_array(scalars, na_value=libmissing.NA, copy=copy)
    361 # Manually creating new array avoids the validation step in the __init__, so is
    362 # faster. Refactor need for validation?
    363 new_string_array = cls.__new__(cls)

File ~/miniconda3/envs/pandas-2.0/lib/python3.10/site-packages/pandas/_libs/lib.pyx:712, in pandas._libs.lib.ensure_string_array()

File ~/miniconda3/envs/pandas-2.0/lib/python3.10/site-packages/pandas/_libs/lib.pyx:754, in pandas._libs.lib.ensure_string_array()

File ~/miniconda3/envs/pandas-2.0/lib/python3.10/site-packages/pyarrow/array.pxi:1475, in pyarrow.lib.Array.to_numpy()

File ~/miniconda3/envs/pandas-2.0/lib/python3.10/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status()

ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True

This also occurs if dtype="string[pyarrow]":

pd.Series(pa.array("the quick brown fox".split()), dtype="string[pyarrow]")
# ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True
traceback
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[2], line 3
      1 import pandas as pd, pyarrow as pa
----> 3 pd.Series(pa.array("the quick brown fox".split()), dtype="string[pyarrow]")

File ~/miniconda3/envs/pandas-2.0/lib/python3.10/site-packages/pandas/core/series.py:490, in Series.__init__(self, data, index, dtype, name, copy, fastpath)
    488         data = data.copy()
    489 else:
--> 490     data = sanitize_array(data, index, dtype, copy)
    492     manager = get_option("mode.data_manager")
    493     if manager == "block":

File ~/miniconda3/envs/pandas-2.0/lib/python3.10/site-packages/pandas/core/construction.py:565, in sanitize_array(data, index, dtype, copy, allow_2d)
    563     _sanitize_non_ordered(data)
    564     cls = dtype.construct_array_type()
--> 565     subarr = cls._from_sequence(data, dtype=dtype, copy=copy)
    567 # GH#846
    568 elif isinstance(data, np.ndarray):

File ~/miniconda3/envs/pandas-2.0/lib/python3.10/site-packages/pandas/core/arrays/string_arrow.py:149, in ArrowStringArray._from_sequence(cls, scalars, dtype, copy)
    146     return cls(pa.array(result, mask=na_values, type=pa.string()))
    148 # convert non-na-likes to str
--> 149 result = lib.ensure_string_array(scalars, copy=copy)
    150 return cls(pa.array(result, type=pa.string(), from_pandas=True))

File ~/miniconda3/envs/pandas-2.0/lib/python3.10/site-packages/pandas/_libs/lib.pyx:712, in pandas._libs.lib.ensure_string_array()

File ~/miniconda3/envs/pandas-2.0/lib/python3.10/site-packages/pandas/_libs/lib.pyx:754, in pandas._libs.lib.ensure_string_array()

File ~/miniconda3/envs/pandas-2.0/lib/python3.10/site-packages/pyarrow/array.pxi:1475, in pyarrow.lib.Array.to_numpy()

File ~/miniconda3/envs/pandas-2.0/lib/python3.10/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status()

ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True

Expected Behavior

This probably shouldn't error, and instead result in a series of the appropriate dtype.

I would note that this seems to work on 1.5.3

Installed Versions

INSTALLED VERSIONS

commit : 1a2e300
python : 3.10.9.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Tue Jun 21 20:50:28 PDT 2022; root:xnu-7195.141.32~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 2.0.0rc0
numpy : 1.23.5
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 67.4.0
pip : 23.0.1
Cython : None
pytest : 7.2.1
hypothesis : None
sphinx : 6.1.3
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.11.0
pandas_datareader: None
bs4 : 4.11.2
bottleneck : 1.3.7rc1
brotli : None
fastparquet : None
fsspec : 2023.1.0
gcsfs : None
matplotlib : 3.7.0
numba : 0.56.4
numexpr : 2.8.4
odfpy : None
openpyxl : 3.1.1
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : 2023.1.0
scipy : 1.10.1
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : None
qtpy : None
pyqt5 : None

@ivirshup ivirshup added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 8, 2023
@jorisvandenbossche jorisvandenbossche added Regression Functionality that used to work in a prior pandas version Strings String extension data type and string data Arrow pyarrow functionality and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Regression Functionality that used to work in a prior pandas version Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants