Skip to content

BUG: pd.to_numeric(.., dtype_backend='pyarrow') crashes for string[pyarrow] #52588

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
mplatzer opened this issue Apr 11, 2023 · 1 comment · Fixed by #52606
Closed
2 of 3 tasks

BUG: pd.to_numeric(.., dtype_backend='pyarrow') crashes for string[pyarrow] #52588

mplatzer opened this issue Apr 11, 2023 · 1 comment · Fixed by #52606
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@mplatzer
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
pd.DataFrame({'x': ['1', '2', 'x']}).to_csv('test.csv')
df = pd.read_csv('test.csv', engine='pyarrow', dtype_backend='pyarrow')
# this works
pd.to_numeric(df['x'], errors='coerce')
# this works
pd.to_numeric(df['x'].astype('str'), errors='coerce', dtype_backend='pyarrow')
# this crashes
pd.to_numeric(df['x'], errors='coerce', dtype_backend='pyarrow')

Issue Description

the call to to_numeric crashes with the follow error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[385], line 9
      7 pd.to_numeric(df['x'].astype('str'), errors='coerce', dtype_backend='pyarrow')
      8 # this crashes
----> 9 pd.to_numeric(df['x'], errors='coerce', dtype_backend='pyarrow')

File ~/miniconda3/envs/mostly-data/lib/python3.9/site-packages/pandas/core/tools/numeric.py:279, in to_numeric(arg, errors, downcast, dtype_backend)
    277 assert isinstance(mask, np.ndarray)
    278 data = np.zeros(mask.shape, dtype=values.dtype)
--> 279 data[~mask] = values
    281 from pandas.core.arrays import (
    282     ArrowExtensionArray,
    283     BooleanArray,
    284     FloatingArray,
    285     IntegerArray,
    286 )
    288 klass: type[IntegerArray] | type[BooleanArray] | type[FloatingArray]

ValueError: NumPy boolean array indexing assignment cannot assign 2 input values to the 3 output values where the mask is true

Expected Behavior

No crash, and same output as for pd.to_numeric(df['x'], errors='coerce')

Installed Versions

INSTALLED VERSIONS ------------------ commit : 478d340 python : 3.9.16.final.0 python-bits : 64 OS : Darwin OS-release : 22.3.0 Version : Darwin Kernel Version 22.3.0: Mon Jan 30 20:42:11 PST 2023; root:xnu-8792.81.3~2/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.0.0
numpy : 1.24.2
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.6.1
pip : 23.0.1
Cython : None
pytest : 7.2.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.6
jinja2 : 3.1.2
IPython : 8.12.0
pandas_datareader: None
bs4 : 4.12.1
bottleneck : None
brotli : None
fastparquet : 0.8.3
fsspec : 2023.3.0
gcsfs : 2023.3.0
matplotlib : 3.6.3
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
pyxlsb : None
s3fs : 2023.3.0
scipy : None
snappy : None
sqlalchemy : 2.0.9
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@mplatzer mplatzer added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 11, 2023
@phofl phofl added Dtype Conversions Unexpected or buggy dtype conversions and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 11, 2023
@phofl phofl added this to the 2.0.1 milestone Apr 11, 2023
@DOH-EKP0303
Copy link

Hello, Although this particular bug was fixed, I would like to note that as soon as I insert a null value into the data frame then we get the same issue as described above again:

Reproducible Example

import pandas as pd
pd.DataFrame({'x': ['1', '2', 'x', np.nan]}).to_csv('test.csv')
df = pd.read_csv('test.csv', engine='pyarrow', dtype_backend='pyarrow')
# this works
pd.to_numeric(df['x'], errors='coerce')
# this works
pd.to_numeric(df['x'].astype('str'), errors='coerce', dtype_backend='pyarrow')
# this crashes
pd.to_numeric(df['x'], errors='coerce', dtype_backend='pyarrow')

Issue Description

The call to_numeric crashes with the following error:

ValueError                                Traceback (most recent call last)
Cell In[157], line 9
      7 pd.to_numeric(df['x'].astype('str'), errors='coerce', dtype_backend='pyarrow')
      8 # this crashes
----> 9 pd.to_numeric(df['x'], errors='coerce', dtype_backend='pyarrow')

File ~\Miniconda3\envs\transform24\lib\site-packages\pandas\core\tools\numeric.py:297, in to_numeric(arg, errors, downcast, dtype_backend)
    295 assert isinstance(mask, np.ndarray)
    296 data = np.zeros(mask.shape, dtype=values.dtype)
--> 297 data[~mask] = values
    299 from pandas.core.arrays import (
    300     ArrowExtensionArray,
    301     BooleanArray,
    302     FloatingArray,
    303     IntegerArray,
    304 )
    306 klass: type[IntegerArray | BooleanArray | FloatingArray]

ValueError: NumPy boolean array indexing assignment cannot assign 2 input values to the 3 output values where the mask is true

Expected Behavior

No crash, and same output as for pd.to_numeric(df['x'], errors='coerce'), except with non numeric values forced to pyarrow null type instead of numpy null type.

Installed Versions

Pandas = 2.2.1
Pyarrow = 14.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants