Skip to content

Commit 05f2f71

Browse files
REGR: fix read_parquet with column of large strings (avoid overflow from concat) (#55691)
* REGR: fix read_parquet with column of large strings (avoid overflow from concat) * comment out test * add comment
1 parent 8afd868 commit 05f2f71

File tree

3 files changed

+23
-2
lines changed

3 files changed

+23
-2
lines changed

doc/source/whatsnew/v2.1.2.rst

+1
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ Fixed regressions
2828
- Fixed regression in :meth:`DataFrameGroupBy.agg` and :meth:`SeriesGroupBy.agg` where if the option ``compute.use_numba`` was set to True, groupby methods not supported by the numba engine would raise a ``TypeError`` (:issue:`55520`)
2929
- Fixed performance regression with wide DataFrames, typically involving methods where all columns were accessed individually (:issue:`55256`, :issue:`55245`)
3030
- Fixed regression in :func:`merge_asof` raising ``TypeError`` for ``by`` with datetime and timedelta dtypes (:issue:`55453`)
31+
- Fixed regression in :func:`read_parquet` when reading a file with a string column consisting of more than 2 GB of string data and using the ``"string"`` dtype (:issue:`55606`)
3132
- Fixed regression in :meth:`DataFrame.to_sql` not roundtripping datetime columns correctly for sqlite when using ``detect_types`` (:issue:`55554`)
3233
- Fixed regression in construction of certain DataFrame or Series subclasses (:issue:`54922`)
3334

pandas/core/arrays/string_.py

+10-2
Original file line numberDiff line numberDiff line change
@@ -228,11 +228,19 @@ def __from_arrow__(
228228
# pyarrow.ChunkedArray
229229
chunks = array.chunks
230230

231+
results = []
232+
for arr in chunks:
233+
# convert chunk by chunk to numpy and concatenate then, to avoid
234+
# overflow for large string data when concatenating the pyarrow arrays
235+
arr = arr.to_numpy(zero_copy_only=False)
236+
arr = ensure_string_array(arr, na_value=libmissing.NA)
237+
results.append(arr)
238+
231239
if len(chunks) == 0:
232240
arr = np.array([], dtype=object)
233241
else:
234-
arr = pyarrow.concat_arrays(chunks).to_numpy(zero_copy_only=False)
235-
arr = ensure_string_array(arr, na_value=libmissing.NA)
242+
arr = np.concatenate(results)
243+
236244
# Bypass validation inside StringArray constructor, see GH#47781
237245
new_string_array = StringArray.__new__(StringArray)
238246
NDArrayBacked.__init__(

pandas/tests/io/test_parquet.py

+12
Original file line numberDiff line numberDiff line change
@@ -1141,6 +1141,18 @@ def test_infer_string_large_string_type(self, tmp_path, pa):
11411141
)
11421142
tm.assert_frame_equal(result, expected)
11431143

1144+
# NOTE: this test is not run by default, because it requires a lot of memory (>5GB)
1145+
# @pytest.mark.slow
1146+
# def test_string_column_above_2GB(self, tmp_path, pa):
1147+
# # https://github.com/pandas-dev/pandas/issues/55606
1148+
# # above 2GB of string data
1149+
# v1 = b"x" * 100000000
1150+
# v2 = b"x" * 147483646
1151+
# df = pd.DataFrame({"strings": [v1] * 20 + [v2] + ["x"] * 20}, dtype="string")
1152+
# df.to_parquet(tmp_path / "test.parquet")
1153+
# result = read_parquet(tmp_path / "test.parquet")
1154+
# assert result["strings"].dtype == "string"
1155+
11441156

11451157
class TestParquetFastParquet(Base):
11461158
def test_basic(self, fp, df_full):

0 commit comments

Comments
 (0)