Skip to content

Commit 94a8af5

Browse files
authored
PERF: Series.str.split(expand=True) for pyarrow-backed strings (pandas-dev#53585)
* PERF: Series.str.split(expand=True) for pyarrow-backed * gh ref
1 parent 828e743 commit 94a8af5

File tree

2 files changed

+8
-2
lines changed

2 files changed

+8
-2
lines changed

doc/source/whatsnew/v2.1.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -316,6 +316,7 @@ Performance improvements
316316
- Performance improvement in :meth:`Series.add` for pyarrow string and binary dtypes (:issue:`53150`)
317317
- Performance improvement in :meth:`Series.corr` and :meth:`Series.cov` for extension dtypes (:issue:`52502`)
318318
- Performance improvement in :meth:`Series.str.get` for pyarrow-backed strings (:issue:`53152`)
319+
- Performance improvement in :meth:`Series.str.split` with ``expand=True`` for pyarrow-backed strings (:issue:`53585`)
319320
- Performance improvement in :meth:`Series.to_numpy` when dtype is a numpy float dtype and ``na_value`` is ``np.nan`` (:issue:`52430`)
320321
- Performance improvement in :meth:`~arrays.ArrowExtensionArray.astype` when converting from a pyarrow timestamp or duration dtype to numpy (:issue:`53326`)
321322
- Performance improvement in :meth:`~arrays.ArrowExtensionArray.to_numpy` (:issue:`52525`)

pandas/core/strings/accessor.py

+7-2
Original file line numberDiff line numberDiff line change
@@ -279,7 +279,7 @@ def _wrap_result(
279279

280280
from pandas.core.arrays.arrow.array import ArrowExtensionArray
281281

282-
value_lengths = result._pa_array.combine_chunks().value_lengths()
282+
value_lengths = pa.compute.list_value_length(result._pa_array)
283283
max_len = pa.compute.max(value_lengths).as_py()
284284
min_len = pa.compute.min(value_lengths).as_py()
285285
if result._hasna:
@@ -313,9 +313,14 @@ def _wrap_result(
313313
labels = name
314314
else:
315315
labels = range(max_len)
316+
result = (
317+
pa.compute.list_flatten(result._pa_array)
318+
.to_numpy()
319+
.reshape(len(result), max_len)
320+
)
316321
result = {
317322
label: ArrowExtensionArray(pa.array(res))
318-
for label, res in zip(labels, (zip(*result.tolist())))
323+
for label, res in zip(labels, result.T)
319324
}
320325
elif is_object_dtype(result):
321326

0 commit comments

Comments
 (0)