Skip to content

Commit cc58350

Browse files
authored
PERF: groupby aggregations on pyarrow timestamp and duration types (#55131)
* PERF: groupby aggregations on pyarrow timestamp and duration types * mypy * update
1 parent 7134f2c commit cc58350

File tree

2 files changed

+11
-2
lines changed

2 files changed

+11
-2
lines changed

doc/source/whatsnew/v2.2.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -177,6 +177,7 @@ Performance improvements
177177
~~~~~~~~~~~~~~~~~~~~~~~~
178178
- Performance improvement in :func:`concat` with ``axis=1`` and objects with unaligned indexes (:issue:`55084`)
179179
- Performance improvement in :func:`to_dict` on converting DataFrame to dictionary (:issue:`50990`)
180+
- Performance improvement in :meth:`DataFrame.groupby` when aggregating pyarrow timestamp and duration dtypes (:issue:`55031`)
180181
- Performance improvement in :meth:`DataFrame.sort_index` and :meth:`Series.sort_index` when indexed by a :class:`MultiIndex` (:issue:`54835`)
181182
- Performance improvement in :meth:`Index.difference` (:issue:`55108`)
182183
- Performance improvement when indexing with more than 4 keys (:issue:`54550`)

pandas/core/arrays/arrow/array.py

+10-2
Original file line numberDiff line numberDiff line change
@@ -1993,9 +1993,17 @@ def _groupby_op(
19931993
**kwargs,
19941994
)
19951995

1996-
masked = self._to_masked()
1996+
# maybe convert to a compatible dtype optimized for groupby
1997+
values: ExtensionArray
1998+
pa_type = self._pa_array.type
1999+
if pa.types.is_timestamp(pa_type):
2000+
values = self._to_datetimearray()
2001+
elif pa.types.is_duration(pa_type):
2002+
values = self._to_timedeltaarray()
2003+
else:
2004+
values = self._to_masked()
19972005

1998-
result = masked._groupby_op(
2006+
result = values._groupby_op(
19992007
how=how,
20002008
has_dropped_na=has_dropped_na,
20012009
min_count=min_count,

0 commit comments

Comments
 (0)