PERF: groupby reductions with pyarrow dtypes #52469

jbrockmendel · 2023-04-05T22:29:34Z

closes PERF: group by manipulation is slower with new arrow engine #52070 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Re-running the benchmark in #52070

%timeit df_new.groupby("s")["v1"].sum()
584 ms ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)   # <- main
247 ms ± 4.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)   # <- PR

%timeit df_old.groupby("s")["v1"].sum()
288 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The _to_masked conversion constitutes about 2/3 of the runtime of the _groupby_op call, so there is still room for improvement. (Though the .sum() is only about 1/3 of the total runtime here)

mroeschke · 2023-04-07T02:16:58Z

Thanks @jbrockmendel

jorisvandenbossche · 2023-04-07T12:18:05Z

pandas/core/arrays/arrow/array.py

+
+        mask = self.isna()
+        arr = self.to_numpy(dtype=np_dtype, na_value=na_value)
+        return arr_cls(arr, mask)


See my comment on the issue, I think this can be optimized and simplified by reusing __from_arrow__ (which uses pyarrow_array_to_numpy_and_mask under the hood)

randolf-scholz · 2023-07-19T16:07:37Z

There still is huge performance degradation in pandas 2.0.3:

import numpy as np
import pandas as pd

M, N = 10, 10_000
tol = 0.5

y = np.random.rand(N, M)
y[y > tol] = float("nan")

df = pd.DataFrame(y, dtype="float32[pyarrow]")
df.index.name = "time"

%%time
df.convert_dtypes(dtype_backend="numpy_nullable").groupby("time").mean()

finishes in 13.2 ms

%%time
df.groupby("time").mean()

takes 4.42 s (!)

I noticed this when I tried aggregating a timeseries with duplicate index entries. But it also happens when grouping by a column.

jbrockmendel · 2023-07-19T16:13:06Z

I don't see anything close to 4.42s

In [2]: df2 = df.convert_dtypes(dtype_backend="numpy_nullable")

In [3]: gb = df.groupby("time")

In [4]: gb2 = df2.groupby("time")

In [5]: %timeit gb.mean()
4.49 ms ± 28.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit gb2.mean()
1.75 ms ± 61.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

randolf-scholz · 2023-07-19T16:18:51Z

Strange. this is my environment


INSTALLED VERSIONS
------------------
commit           : 0f437949513225922d851e9581723d82120684a6
python           : 3.11.3.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.19.0-46-generic
Version          : #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 21 15:35:31 UTC 2
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 2.0.3
numpy            : 1.24.3
pytz             : 2023.3
dateutil         : 2.8.2
setuptools       : 68.0.0
pip              : 23.2
Cython           : 0.29.36
pytest           : 7.4.0
hypothesis       : None
sphinx           : 7.0.1
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.9.3
html5lib         : None
pymysql          : 1.4.6
psycopg2         : None
jinja2           : 3.1.2
IPython          : 8.14.0
pandas_datareader: None
bs4              : 4.12.2
bottleneck       : None
brotli           : None
fastparquet      : 2023.7.0
fsspec           : 2023.6.0
gcsfs            : None
matplotlib       : 3.7.2
numba            : 0.57.1
numexpr          : 2.8.4
odfpy            : None
openpyxl         : 3.1.2
pandas_gbq       : None
pyarrow          : 12.0.1
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : 1.11.1
snappy           : None
sqlalchemy       : 1.4.49
tables           : 3.8.0
tabulate         : 0.9.0
xarray           : 2023.7.0
xlrd             : None
zstandard        : None
tzdata           : 2023.3
qtpy             : None
pyqt5            : None

I will try on another machine when I'm home.

randolf-scholz · 2023-07-20T07:31:14Z

I could reproduce it on my desktop as well. Both machines run Ubuntu 22.04, desktop CPU: AMD 3900X, Laptop CPU: i7-11800H. Tried both python 3.10 and 3.11.

randolf-scholz · 2023-07-20T08:29:42Z

I was able to reproduce it in google-colab as well: https://colab.research.google.com/drive/1kZVSEmLOWXGLV8uRvrc2ACFXdiqoq0cd?usp=sharing

randolf-scholz · 2023-07-20T12:55:34Z

Should I open a separate issue for this?

jbrockmendel · 2023-07-20T15:27:02Z

Sure, let's see if anyone else can reproduce

randolf-scholz · 2023-07-20T15:48:22Z

Opened #54208

jbrockmendel added 22 commits February 4, 2023 12:45

REF: move groupby reduction methods to EA

9528ffc

REF: move EA-specific checks to EAs

1ea4a72

REF: dont pass op to groupby_op

8645cf1

mypy fixup

ea40255

groupby_op -> _groupby_op

a2e7e64

Merge branch 'main' into ref-ea-gb-reductions-3

6a2788b

Merge branch 'main' into ref-ea-gb-reductions-3

ec58ebb

Merge branch 'main' into ref-ea-gb-reductions-3

fcfd76d

Merge branch 'main' into ref-ea-gb-reductions-3

a3f32d0

mypy fixup

d80c0b2

Merge branch 'main' into ref-ea-gb-reductions-3

c78af4d

Merge branch 'main' into ref-ea-gb-reductions-3

0248ce3

Merge branch 'main' into ref-ea-gb-reductions-3

2206dea

Merge branch 'main' into ref-ea-gb-reductions-3

335eb11

Merge branch 'main' into ref-ea-gb-reductions-3

2517eae

mypy fixup

fab8725

Merge branch 'main' into ref-ea-gb-reductions-3

dd5428e

Merge branch 'main' into ref-ea-gb-reductions-3

7a969a8

Merge branch 'main' into perf-arrow-gb

c93bf14

Merge branch 'main' into perf-arrow-gb

b2f8536

PERF: Groupby reductions with pyarrow dtypes

7448e54

mypy fixup

81cbd73

mroeschke approved these changes Apr 7, 2023

View reviewed changes

mroeschke added Groupby Performance Memory or execution speed performance Arrow pyarrow functionality labels Apr 7, 2023

mroeschke added this to the 2.1 milestone Apr 7, 2023

mroeschke merged commit c94f9af into pandas-dev:main Apr 7, 2023

jbrockmendel deleted the perf-arrow-gb branch April 7, 2023 03:14

jorisvandenbossche reviewed Apr 7, 2023

View reviewed changes

randolf-scholz mentioned this pull request Jul 20, 2023

PERF: GroupBy.mean orders of magnitude slower for pyarrow dtypes. (2.0.3) #54207

Closed

3 tasks

randolf-scholz mentioned this pull request Jul 20, 2023

PERF: GroupBy.mean orders of magnitude slower for pyarrow dtypes. (2.0.3) #54208

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: groupby reductions with pyarrow dtypes #52469

PERF: groupby reductions with pyarrow dtypes #52469

jbrockmendel commented Apr 5, 2023

mroeschke commented Apr 7, 2023

jorisvandenbossche Apr 7, 2023

randolf-scholz commented Jul 19, 2023 •

edited

Loading

jbrockmendel commented Jul 19, 2023

randolf-scholz commented Jul 19, 2023 •

edited

Loading

randolf-scholz commented Jul 20, 2023 •

edited

Loading

randolf-scholz commented Jul 20, 2023

randolf-scholz commented Jul 20, 2023

jbrockmendel commented Jul 20, 2023

randolf-scholz commented Jul 20, 2023 •

edited

Loading

PERF: groupby reductions with pyarrow dtypes #52469

PERF: groupby reductions with pyarrow dtypes #52469

Conversation

jbrockmendel commented Apr 5, 2023

mroeschke commented Apr 7, 2023

jorisvandenbossche Apr 7, 2023

Choose a reason for hiding this comment

randolf-scholz commented Jul 19, 2023 • edited Loading

jbrockmendel commented Jul 19, 2023

randolf-scholz commented Jul 19, 2023 • edited Loading

randolf-scholz commented Jul 20, 2023 • edited Loading

randolf-scholz commented Jul 20, 2023

randolf-scholz commented Jul 20, 2023

jbrockmendel commented Jul 20, 2023

randolf-scholz commented Jul 20, 2023 • edited Loading

randolf-scholz commented Jul 19, 2023 •

edited

Loading

randolf-scholz commented Jul 19, 2023 •

edited

Loading

randolf-scholz commented Jul 20, 2023 •

edited

Loading

randolf-scholz commented Jul 20, 2023 •

edited

Loading