BUG: python crashes on filtering with `.loc` on boolean Series with `dtype_backend=pyarrow` on some dataframes. #52059

pstorozenko · 2023-03-17T22:25:45Z

This bug is related to pandas 2.0. On 1.5.3 (with numpy as dtype backend) everything works.

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

Using parquet from wiki100.zip.

import pandas as pd

wiki = pd.read_parquet("wiki100.parquet", engine="pyarrow",  dtype_backend="pyarrow")

wiki.loc[wiki['curr'] == "Warsaw", :]

Issue Description

After running such script I get

➜  python 02_pandas20.py
[1]    24955 floating point exception (core dumped)  python 02_pandas20.py

Expected Behavior

I should get the result of this query:

                                                       prev    curr  type    n
13488247                                            Trumpet  Warsaw  link   10
13488399                                Bronislava_Nijinska  Warsaw  link   33
13488166  List_of_European_cities_by_population_within_c...  Warsaw  link  480
13488365                               Warsaw_pogrom_(1881)  Warsaw  link   10
13488408                                     Witold_Pilecki  Warsaw  link   19

Installed Versions

INSTALLED VERSIONS

commit : 23c3dc2
python : 3.10.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.19.0-35-generic
Version : #36~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Feb 17 15:17:25 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.0.dev0+246.g23c3dc2c37
numpy : 1.25.0.dev0+918.g28bce82c8
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 65.6.3
pip : 23.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2022.7
qtpy : None
pyqt5 : None
None

Context

The file is a subset subset of wiki clickstream data.

This code works well if I don't set dtype_backend="pyarrow".
Tested in both 2.0.0rc1 and on nightly.

The operation wiki['curr'] == "Warsaw" executes correctly so it's an issue with filtering on boolean array.
Sorry for not providing a simpler example, as all those handcrafted worked all the time.

The text was updated successfully, but these errors were encountered:

eloyfelix · 2023-03-18T12:52:13Z

I have found what I believe is the same bug.

Tested it in 2.0 RC1 and in master branch 2.1.0.dev0+265.gd8d1a474c7

Using a single string column csv file bug.csv.gz

import pandas as pd
df = pd.read_csv("bug.csv", dtype={"data_validity_comment": "string[pyarrow]"})
outside = df[df["data_validity_comment"] == "Outside typical range"]
# python crashes

Dataframe shape is (145000, 1). It stops crashing if I cut it exactly at 143116.

import pandas as pd
df = pd.read_csv("bug.csv", dtype={"data_validity_comment": "string[pyarrow]"})
df = df[0:143116].copy()
outside = df[df["data_validity_comment"] == "Outside typical range"]

This is also the simpler example I could come with...

phofl · 2023-03-18T22:28:26Z

cc @mroeschke

Looks like this is coming from pc.replace_with_mask(values, pa.array(mask), replacements) in _replace_with_mask. There is a comment that this caused segfaults for earlier version e.g. pyarrow < 8. Looks like this is not completely solved. Thoughts here?

The following reproduces for me:

x = pa.scalar(False, type=pa.bool_(), from_pandas=True)
arr = pa.chunked_array([np.array([True] * 5)])
mask = pa.array([False] * 5)
pc.replace_with_mask(arr, mask, pa.array([False] * 5))

lukemanley · 2023-03-19T00:11:38Z

Calling arr.combine_chunks() before passing to pc.replace_with_mask seems to work. Probably a good idea to report upstream since this is happening with the latest version of pyarrow.

pstorozenko · 2023-03-19T09:33:33Z

Unfortunately I never used pyarrow directly, could you report it, as you know much better what is really bugged here?

lukemanley · 2023-03-19T15:18:17Z

I opened apache/arrow#34634 upstream in the arrow repo.

char101 · 2023-03-22T18:54:01Z

I also experienced this, pyarrow backed series contains <NA> in comparison result.

pd.Series([1, pd.NA, 2]) > 0
Out[24]: 
0     True
1    False
2     True
dtype: bool

pd.Series([1, pd.NA, 2], dtype='int32[pyarrow]') > 0
Out[25]: 
0    True
1    <NA>
2    True
dtype: bool[pyarrow]

pstorozenko · 2023-03-22T23:27:32Z

@char101
I think that's by design and that's how it should work. The problem with numpy array is that it cannot differentiate between NA and NaN for float, and don't have any NA for ints (so int arrays with NA are converted to floats as in your case).
With arrow backend, we can finally differentiate between NA and NaN for floats, and introduce 'proper' na for other types, like here for ints.
You don't know whether NA is greater or smaller than 0, so you get NA in the result.

char101 · 2023-03-23T02:47:56Z

@pstorozenko Thanks, I understand. The reason I thought it was a bug is that the first <NA> from the result of diff() crashed python as I wrote in #52122 . The difference is that it crashed in 517 elements rather than 145000 like in this issue.

Edit: Adding values.combine_chunks() as from the pull request by phofl fixes it. I was using the nightly wheel of pandas and I thought that the pull request has already been merged but when I checked it again it was still open.

pstorozenko added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 17, 2023

pstorozenko changed the title ~~BUG:~~ BUG: python crashes on filtering with .loc on boolean Series with dtype_backend=pyarrow on some dataframes. Mar 17, 2023

phofl added Segfault Non-Recoverable Error Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 18, 2023

phofl mentioned this issue Mar 19, 2023

BUG: Arrow setitem segfaults when len > 145 000 #52075

Merged

5 tasks

mroeschke mentioned this issue Mar 22, 2023

BUG: pyarrow backed series might contains <NA> after comparison #52122

Closed

3 tasks

mroeschke closed this as completed in #52075 Mar 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: python crashes on filtering with `.loc` on boolean Series with `dtype_backend=pyarrow` on some dataframes. #52059

BUG: python crashes on filtering with `.loc` on boolean Series with `dtype_backend=pyarrow` on some dataframes. #52059

pstorozenko commented Mar 17, 2023 •

edited

Loading

INSTALLED VERSIONS

eloyfelix commented Mar 18, 2023 •

edited

Loading

phofl commented Mar 18, 2023 •

edited

Loading

lukemanley commented Mar 19, 2023

pstorozenko commented Mar 19, 2023

lukemanley commented Mar 19, 2023

char101 commented Mar 22, 2023 •

edited

Loading

pstorozenko commented Mar 22, 2023 •

edited

Loading

char101 commented Mar 23, 2023 •

edited

Loading

BUG: python crashes on filtering with .loc on boolean Series with dtype_backend=pyarrow on some dataframes. #52059

BUG: python crashes on filtering with .loc on boolean Series with dtype_backend=pyarrow on some dataframes. #52059

Comments

pstorozenko commented Mar 17, 2023 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

Context

eloyfelix commented Mar 18, 2023 • edited Loading

phofl commented Mar 18, 2023 • edited Loading

lukemanley commented Mar 19, 2023

pstorozenko commented Mar 19, 2023

lukemanley commented Mar 19, 2023

char101 commented Mar 22, 2023 • edited Loading

pstorozenko commented Mar 22, 2023 • edited Loading

char101 commented Mar 23, 2023 • edited Loading

BUG: python crashes on filtering with `.loc` on boolean Series with `dtype_backend=pyarrow` on some dataframes. #52059

BUG: python crashes on filtering with `.loc` on boolean Series with `dtype_backend=pyarrow` on some dataframes. #52059

pstorozenko commented Mar 17, 2023 •

edited

Loading

eloyfelix commented Mar 18, 2023 •

edited

Loading

phofl commented Mar 18, 2023 •

edited

Loading

char101 commented Mar 22, 2023 •

edited

Loading

pstorozenko commented Mar 22, 2023 •

edited

Loading

char101 commented Mar 23, 2023 •

edited

Loading