Skip to content

BUG: regression in read_parquet that raises a pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays #55606

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
arnaudlegout opened this issue Oct 20, 2023 · 6 comments · Fixed by #55691
Labels
IO Parquet parquet, feather Regression Functionality that used to work in a prior pandas version Strings String extension data type and string data
Milestone

Comments

@arnaudlegout
Copy link
Contributor

arnaudlegout commented Oct 20, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
pd.read_parquet(all_tx_info)

The file I am using is 71GB. So I cannot share it easily. If the error is clear from the stack trace, that is fine. If ever it is not clear, I can spend time trying the reproduce this issue with a synthetic dataFrame.

Issue Description

I have the following parquet stored on disk:

>>> tx.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
Index: 904346808 entries, 0 to 804503
Data columns (total 11 columns):
 #   Column              Dtype
---  ------              -----
 0   tx_hash             string
 1   block_height        Int64
 2   block_date          datetime64[ns]
 3   tx_version          Int64
 4   is_coinbase         boolean
 5   is_segwit           boolean
 6   total_tx_in_value   Int64
 7   total_tx_out_value  Int64
 8   tx_fee              Int64
 9   nb_in               Int64
 10  nb_out              Int64
dtypes: Int64(7), boolean(2), datetime64[ns](1), string(1)
memory usage: 171.8 GB

With pandas 2.0.3, when I load it using read_parquet, it works without any issue.

However, with pandas 2.1.1, I have the following exception

pd.read_parquet(all_tx_info)
Traceback (most recent call last):
  File "/home/alegout/bc_tools.py", line 694, in load_all_tx_info_to_dataframe
    return pd.read_parquet(all_tx_info)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/user/alegout/home/miniconda3/envs/py3.11/lib/python3.11/site-packages/pandas/io/parquet.py", line 670, in read_parquet
    return impl.read(
           ^^^^^^^^^^
  File "/user/alegout/home/miniconda3/envs/py3.11/lib/python3.11/site-packages/pandas/io/parquet.py", line 279, in read
    result = pa_table.to_pandas(**to_pandas_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/array.pxi", line 830, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 3990, in pyarrow.lib.Table._to_pandas
  File "/user/alegout/home/miniconda3/envs/py3.11/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 820, in table_to_blockmanager
    blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/user/alegout/home/miniconda3/envs/py3.11/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 1171, in _table_to_blocks
    return [_reconstruct_block(item, columns, extension_columns)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/user/alegout/home/miniconda3/envs/py3.11/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 1171, in <listcomp>
    return [_reconstruct_block(item, columns, extension_columns)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/user/alegout/home/miniconda3/envs/py3.11/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 780, in _reconstruct_block
    pd_ext_arr = pandas_dtype.__from_arrow__(arr)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/user/alegout/home/miniconda3/envs/py3.11/lib/python3.11/site-packages/pandas/core/arrays/string_.py", line 229, in __from_arrow__
    arr = pyarrow.concat_arrays(chunks).to_numpy(zero_copy_only=False)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/array.pxi", line 3039, in pyarrow.lib.concat_arrays
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays			

Expected Behavior

The dataframe loads from the parquet without an exception

Installed Versions

Here are the installed version for the environment that works with pandas 2.0.3

INSTALLED VERSIONS ------------------ commit : 0f43794 python : 3.11.5.final.0 python-bits : 64 OS : Linux OS-release : 6.5.6-200.fc38.x86_64 Version : #1 SMP PREEMPT_DYNAMIC Fri Oct 6 19:02:35 UTC 2023 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.0.3
numpy : 1.24.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.3
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : 1.3.5
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.2
numba : 0.58.0
numexpr : 2.8.7
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

And here is the environment for the version of pandas 2.1.1 that does not work

INSTALLED VERSIONS ------------------ commit : e86ed37 python : 3.11.5.final.0 python-bits : 64 OS : Linux OS-release : 6.5.6-200.fc38.x86_64 Version : #1 SMP PREEMPT_DYNAMIC Fri Oct 6 19:02:35 UTC 2023 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.1.1
numpy : 1.24.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.2.1
Cython : 3.0.0
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.15.0
pandas_datareader : 0.10.0
bs4 : 4.12.2
bottleneck : 1.3.5
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.9.2
gcsfs : None
matplotlib : 3.7.2
numba : 0.58.0
numexpr : 2.8.7
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
sqlalchemy : 2.0.21
tables : None
tabulate : 0.8.10
xarray : 2023.6.0
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : 2.2.0
pyqt5 : None

@arnaudlegout arnaudlegout added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 20, 2023
@harningle
Copy link

I have the same issue as well when working on data from PatentsView. You can find my parquet file here.

@jorisvandenbossche jorisvandenbossche added Regression Functionality that used to work in a prior pandas version Strings String extension data type and string data IO Parquet parquet, feather and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 20, 2023
@jorisvandenbossche jorisvandenbossche added this to the 2.1.2 milestone Oct 20, 2023
@jorisvandenbossche
Copy link
Member

Thanks for the report!

I assume this is related to #47781 (but didn't actually verify, just from looking at the error) cc @timlod

What is happening here is that reading the Parquet file with pyarrow results in a pyarrow.Table with a string column that consists of multiple chunks. The mentioned PR changed our conversion to first concatenate those chunks on the pyarrow side, and only then convert to an numpy array and put in a pandas string column. But, the way that the arrow string memory layout works is with an array of offsets into an array of all string characters. For the default pyarrow string type, those offsets are int32, and then concatenating is not always possible for large arrays. Arrow does have a large_string type with int64 offsets that avoids this problem, but generally just chunking the string type is fine. But in this case, we are explicitly concatenating the chunks.

I don't know if pyarrow gives an easy API to automatically fall back to large_string type when needed. In theory we could check the size of the buffers for all chunks to verify that we can concatenate the chunks, and otherwise fall back to the original code.
In theory we could check

@arnaudlegout
Copy link
Contributor Author

arnaudlegout commented Oct 20, 2023

Here is a reproducible example raising the exception:

import pandas as pd
import random

n = 50_000_000

df = pd.DataFrame({'a': '83a2a0d9ddb87af7a929f09db9bae8c2b965ffb6d3d'}, index=range(n), dtype='string')
df.to_parquet('test.parquet')

# here is the line that raises the exception
pd.read_parquet('test.parquet')

From my test, if you reduce n (e.g., 30_000_000), no exception.
If you reduce the string size (e.g., '83a2a0d9ddb87af7a929f09db9b') no exception.

@arnaudlegout
Copy link
Contributor Author

If I got it correctly, in arrow, we have an array of bytes and an array of offsets defining string boundaries in the array of bytes. As the array of offsets is int32 (not even uint32?) if the sum of the length of all strings exceed 2147483648, then we have the crash.

This is consistent with my tests.

@timlod
Copy link
Contributor

timlod commented Oct 23, 2023

Thanks for the report!

I assume this is related to #47781 (but didn't actually verify, just from looking at the error) cc @timlod

What is happening here is that reading the Parquet file with pyarrow results in a pyarrow.Table with a string column that consists of multiple chunks. The mentioned PR changed our conversion to first concatenate those chunks on the pyarrow side, and only then convert to an numpy array and put in a pandas string column. But, the way that the arrow string memory layout works is with an array of offsets into an array of all string characters. For the default pyarrow string type, those offsets are int32, and then concatenating is not always possible for large arrays. Arrow does have a large_string type with int64 offsets that avoids this problem, but generally just chunking the string type is fine. But in this case, we are explicitly concatenating the chunks.

I don't know if pyarrow gives an easy API to automatically fall back to large_string type when needed. In theory we could check the size of the buffers for all chunks to verify that we can concatenate the chunks, and otherwise fall back to the original code. In theory we could check

Hi, I think your analysis is correct!
I haven't worked on this in a while due to changing jobs, so rusty on arrow unfortunately.
Based on a quick check, it doesn't appear there's easy logic for this in arrow which we can just use.
So I think your size-checking suggestion would be fix for this - then I think we could use cast to do this (not sure of the performance impact):

chunks = [chunk.cast(pa.large_string(), safe=False) for chunk in chunks]

There might be a more efficient way to concatenate in this case, but tbh, I would probably consider this a bug in pyarrow, and something that should be fixed upstream, or at least documented: https://arrow.apache.org/docs/python/generated/pyarrow.chunked_array.html

@jorisvandenbossche
Copy link
Member

In this case we could indeed cast to large_string, since we convert to numpy anyway and end up with an object dtype numpy array (for our arrow-backed StringDtype, we do store it as string, so casting to large_string as an intermediate step wouldn't help).

But while looking into it, I noticed we could actually make use of faster methods of your original PR (using ensure_string_array, by-passing the validation in StringArray init), while still doing the arrow->numpy conversion chunk-by-chunk.
And that's actually even faster, because concatenating the resulting numpy object-dtype arrays is cheaper than concatenating the pyarrow arrays first (before converting to numpy).

Put that in a PR -> #55691

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Parquet parquet, feather Regression Functionality that used to work in a prior pandas version Strings String extension data type and string data
Projects
None yet
4 participants