-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: regression in read_parquet that raises a pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
#55606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I have the same issue as well when working on data from PatentsView. You can find my parquet file here. |
Thanks for the report! I assume this is related to #47781 (but didn't actually verify, just from looking at the error) cc @timlod What is happening here is that reading the Parquet file with pyarrow results in a pyarrow.Table with a string column that consists of multiple chunks. The mentioned PR changed our conversion to first concatenate those chunks on the pyarrow side, and only then convert to an numpy array and put in a pandas I don't know if pyarrow gives an easy API to automatically fall back to large_string type when needed. In theory we could check the size of the buffers for all chunks to verify that we can concatenate the chunks, and otherwise fall back to the original code. |
Here is a reproducible example raising the exception:
From my test, if you reduce n (e.g., 30_000_000), no exception. |
If I got it correctly, in arrow, we have an array of bytes and an array of offsets defining string boundaries in the array of bytes. As the array of offsets is int32 (not even uint32?) if the sum of the length of all strings exceed 2147483648, then we have the crash. This is consistent with my tests. |
Hi, I think your analysis is correct! chunks = [chunk.cast(pa.large_string(), safe=False) for chunk in chunks] There might be a more efficient way to concatenate in this case, but tbh, I would probably consider this a bug in pyarrow, and something that should be fixed upstream, or at least documented: https://arrow.apache.org/docs/python/generated/pyarrow.chunked_array.html |
In this case we could indeed cast to But while looking into it, I noticed we could actually make use of faster methods of your original PR (using Put that in a PR -> #55691 |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
The file I am using is 71GB. So I cannot share it easily. If the error is clear from the stack trace, that is fine. If ever it is not clear, I can spend time trying the reproduce this issue with a synthetic dataFrame.
Issue Description
I have the following parquet stored on disk:
With pandas 2.0.3, when I load it using
read_parquet
, it works without any issue.However, with pandas 2.1.1, I have the following exception
Expected Behavior
The dataframe loads from the parquet without an exception
Installed Versions
Here are the installed version for the environment that works with pandas 2.0.3
pandas : 2.0.3
numpy : 1.24.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.3
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : 1.3.5
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.2
numba : 0.58.0
numexpr : 2.8.7
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
And here is the environment for the version of pandas 2.1.1 that does not work
pandas : 2.1.1
numpy : 1.24.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.2.1
Cython : 3.0.0
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.15.0
pandas_datareader : 0.10.0
bs4 : 4.12.2
bottleneck : 1.3.5
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.9.2
gcsfs : None
matplotlib : 3.7.2
numba : 0.58.0
numexpr : 2.8.7
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
sqlalchemy : 2.0.21
tables : None
tabulate : 0.8.10
xarray : 2023.6.0
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : 2.2.0
pyqt5 : None
The text was updated successfully, but these errors were encountered: