Skip to content

BUG: python crashes on filtering with .loc on boolean Series with dtype_backend=pyarrow on some dataframes. #52059

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
pstorozenko opened this issue Mar 17, 2023 · 8 comments · Fixed by #52075
Closed
3 tasks done
Labels
Arrow pyarrow functionality Bug Segfault Non-Recoverable Error

Comments

@pstorozenko
Copy link

pstorozenko commented Mar 17, 2023

This bug is related to pandas 2.0. On 1.5.3 (with numpy as dtype backend) everything works.

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

Using parquet from wiki100.zip.

import pandas as pd

wiki = pd.read_parquet("wiki100.parquet", engine="pyarrow",  dtype_backend="pyarrow")

wiki.loc[wiki['curr'] == "Warsaw", :]

Issue Description

After running such script I get

➜  python 02_pandas20.py
[1]    24955 floating point exception (core dumped)  python 02_pandas20.py

Expected Behavior

I should get the result of this query:

                                                       prev    curr  type    n
13488247                                            Trumpet  Warsaw  link   10
13488399                                Bronislava_Nijinska  Warsaw  link   33
13488166  List_of_European_cities_by_population_within_c...  Warsaw  link  480
13488365                               Warsaw_pogrom_(1881)  Warsaw  link   10
13488408                                     Witold_Pilecki  Warsaw  link   19

Installed Versions

INSTALLED VERSIONS

commit : 23c3dc2
python : 3.10.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.19.0-35-generic
Version : #36~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Feb 17 15:17:25 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.0.dev0+246.g23c3dc2c37
numpy : 1.25.0.dev0+918.g28bce82c8
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 65.6.3
pip : 23.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2022.7
qtpy : None
pyqt5 : None
None

Context

The file is a subset subset of wiki clickstream data.

This code works well if I don't set dtype_backend="pyarrow".
Tested in both 2.0.0rc1 and on nightly.

The operation wiki['curr'] == "Warsaw" executes correctly so it's an issue with filtering on boolean array.
Sorry for not providing a simpler example, as all those handcrafted worked all the time.

@pstorozenko pstorozenko added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 17, 2023
@pstorozenko pstorozenko changed the title BUG: BUG: python crashes on filtering with .loc on boolean Series with dtype_backend=pyarrow on some dataframes. Mar 17, 2023
@eloyfelix
Copy link

eloyfelix commented Mar 18, 2023

I have found what I believe is the same bug.

Tested it in 2.0 RC1 and in master branch 2.1.0.dev0+265.gd8d1a474c7

Using a single string column csv file bug.csv.gz

import pandas as pd
df = pd.read_csv("bug.csv", dtype={"data_validity_comment": "string[pyarrow]"})
outside = df[df["data_validity_comment"] == "Outside typical range"]
# python crashes

Dataframe shape is (145000, 1). It stops crashing if I cut it exactly at 143116.

import pandas as pd
df = pd.read_csv("bug.csv", dtype={"data_validity_comment": "string[pyarrow]"})
df = df[0:143116].copy()
outside = df[df["data_validity_comment"] == "Outside typical range"]

This is also the simpler example I could come with...

@phofl
Copy link
Member

phofl commented Mar 18, 2023

cc @mroeschke

Looks like this is coming from pc.replace_with_mask(values, pa.array(mask), replacements) in _replace_with_mask. There is a comment that this caused segfaults for earlier version e.g. pyarrow < 8. Looks like this is not completely solved. Thoughts here?

The following reproduces for me:

x = pa.scalar(False, type=pa.bool_(), from_pandas=True)
arr = pa.chunked_array([np.array([True] * 5)])
mask = pa.array([False] * 5)
pc.replace_with_mask(arr, mask, pa.array([False] * 5))

@phofl phofl added Segfault Non-Recoverable Error Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 18, 2023
@lukemanley
Copy link
Member

Calling arr.combine_chunks() before passing to pc.replace_with_mask seems to work. Probably a good idea to report upstream since this is happening with the latest version of pyarrow.

@pstorozenko
Copy link
Author

Unfortunately I never used pyarrow directly, could you report it, as you know much better what is really bugged here?

@lukemanley
Copy link
Member

I opened apache/arrow#34634 upstream in the arrow repo.

@char101
Copy link

char101 commented Mar 22, 2023

I also experienced this, pyarrow backed series contains <NA> in comparison result.

pd.Series([1, pd.NA, 2]) > 0
Out[24]: 
0     True
1    False
2     True
dtype: bool

pd.Series([1, pd.NA, 2], dtype='int32[pyarrow]') > 0
Out[25]: 
0    True
1    <NA>
2    True
dtype: bool[pyarrow]

@pstorozenko
Copy link
Author

pstorozenko commented Mar 22, 2023

@char101
I think that's by design and that's how it should work. The problem with numpy array is that it cannot differentiate between NA and NaN for float, and don't have any NA for ints (so int arrays with NA are converted to floats as in your case).
With arrow backend, we can finally differentiate between NA and NaN for floats, and introduce 'proper' na for other types, like here for ints.
You don't know whether NA is greater or smaller than 0, so you get NA in the result.

@char101
Copy link

char101 commented Mar 23, 2023

@pstorozenko Thanks, I understand. The reason I thought it was a bug is that the first <NA> from the result of diff() crashed python as I wrote in #52122 . The difference is that it crashed in 517 elements rather than 145000 like in this issue.

Edit: Adding values.combine_chunks() as from the pull request by phofl fixes it. I was using the nightly wheel of pandas and I thought that the pull request has already been merged but when I checked it again it was still open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug Segfault Non-Recoverable Error
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants