Skip to content

Commit edbdf95

Browse files
authored
PERF: Series.str.get_dummies for "string[pyarrow]" and "string[pyarrow_numpy]" dtypes (#56110)
* PERF: Series.str.get_dummies * whatsnew * fix
1 parent fbe024f commit edbdf95

File tree

3 files changed

+10
-1
lines changed

3 files changed

+10
-1
lines changed

asv_bench/benchmarks/strings.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -245,7 +245,8 @@ def time_extract_single_group(self, dtype, expand):
245245
class Dummies(Dtypes):
246246
def setup(self, dtype):
247247
super().setup(dtype)
248-
self.s = self.s.str.join("|")
248+
N = len(self.s) // 5
249+
self.s = self.s[:N].str.join("|")
249250

250251
def time_get_dummies(self, dtype):
251252
self.s.str.get_dummies("|")

doc/source/whatsnew/v2.2.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -328,6 +328,7 @@ Performance improvements
328328
- Performance improvement in :meth:`Index.difference` (:issue:`55108`)
329329
- Performance improvement in :meth:`MultiIndex.get_indexer` when ``method`` is not ``None`` (:issue:`55839`)
330330
- Performance improvement in :meth:`Series.duplicated` for pyarrow dtypes (:issue:`55255`)
331+
- Performance improvement in :meth:`Series.str.get_dummies` when dtype is ``"string[pyarrow]"`` or ``"string[pyarrow_numpy]"`` (:issue:`56110`)
331332
- Performance improvement in :meth:`Series.str` methods (:issue:`55736`)
332333
- Performance improvement in :meth:`Series.value_counts` and :meth:`Series.mode` for masked dtypes (:issue:`54984`, :issue:`55340`)
333334
- Performance improvement in :meth:`SeriesGroupBy.idxmax`, :meth:`SeriesGroupBy.idxmin`, :meth:`DataFrameGroupBy.idxmax`, :meth:`DataFrameGroupBy.idxmin` (:issue:`54234`)

pandas/core/arrays/string_arrow.py

+7
Original file line numberDiff line numberDiff line change
@@ -527,6 +527,13 @@ def _str_find(self, sub: str, start: int = 0, end: int | None = None):
527527
return super()._str_find(sub, start, end)
528528
return self._convert_int_dtype(result)
529529

530+
def _str_get_dummies(self, sep: str = "|"):
531+
dummies_pa, labels = ArrowExtensionArray(self._pa_array)._str_get_dummies(sep)
532+
if len(labels) == 0:
533+
return np.empty(shape=(0, 0), dtype=np.int64), labels
534+
dummies = np.vstack(dummies_pa.to_numpy())
535+
return dummies.astype(np.int64, copy=False), labels
536+
530537
def _convert_int_dtype(self, result):
531538
return Int64Dtype().__from_arrow__(result)
532539

0 commit comments

Comments
 (0)