Skip to content

PERF: groupby reductions with pyarrow dtypes #52469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Apr 7, 2023

Conversation

jbrockmendel
Copy link
Member

Re-running the benchmark in #52070

%timeit df_new.groupby("s")["v1"].sum()
584 ms ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)   # <- main
247 ms ± 4.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)   # <- PR

%timeit df_old.groupby("s")["v1"].sum()
288 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The _to_masked conversion constitutes about 2/3 of the runtime of the _groupby_op call, so there is still room for improvement. (Though the .sum() is only about 1/3 of the total runtime here)

@mroeschke mroeschke added Groupby Performance Memory or execution speed performance Arrow pyarrow functionality labels Apr 7, 2023
@mroeschke mroeschke added this to the 2.1 milestone Apr 7, 2023
@mroeschke mroeschke merged commit c94f9af into pandas-dev:main Apr 7, 2023
@mroeschke
Copy link
Member

Thanks @jbrockmendel

@jbrockmendel jbrockmendel deleted the perf-arrow-gb branch April 7, 2023 03:14

mask = self.isna()
arr = self.to_numpy(dtype=np_dtype, na_value=na_value)
return arr_cls(arr, mask)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment on the issue, I think this can be optimized and simplified by reusing __from_arrow__ (which uses pyarrow_array_to_numpy_and_mask under the hood)

@randolf-scholz
Copy link
Contributor

randolf-scholz commented Jul 19, 2023

There still is huge performance degradation in pandas 2.0.3:

import numpy as np
import pandas as pd

M, N = 10, 10_000
tol = 0.5

y = np.random.rand(N, M)
y[y > tol] = float("nan")

df = pd.DataFrame(y, dtype="float32[pyarrow]")
df.index.name = "time"
%%time
df.convert_dtypes(dtype_backend="numpy_nullable").groupby("time").mean()

finishes in 13.2 ms

%%time
df.groupby("time").mean()

takes 4.42 s (!)

I noticed this when I tried aggregating a timeseries with duplicate index entries. But it also happens when grouping by a column.

@jbrockmendel
Copy link
Member Author

I don't see anything close to 4.42s

In [2]: df2 = df.convert_dtypes(dtype_backend="numpy_nullable")

In [3]: gb = df.groupby("time")

In [4]: gb2 = df2.groupby("time")

In [5]: %timeit gb.mean()
4.49 ms ± 28.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit gb2.mean()
1.75 ms ± 61.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

@randolf-scholz
Copy link
Contributor

randolf-scholz commented Jul 19, 2023

Strange. this is my environment


INSTALLED VERSIONS
------------------
commit           : 0f437949513225922d851e9581723d82120684a6
python           : 3.11.3.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.19.0-46-generic
Version          : #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 21 15:35:31 UTC 2
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 2.0.3
numpy            : 1.24.3
pytz             : 2023.3
dateutil         : 2.8.2
setuptools       : 68.0.0
pip              : 23.2
Cython           : 0.29.36
pytest           : 7.4.0
hypothesis       : None
sphinx           : 7.0.1
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.9.3
html5lib         : None
pymysql          : 1.4.6
psycopg2         : None
jinja2           : 3.1.2
IPython          : 8.14.0
pandas_datareader: None
bs4              : 4.12.2
bottleneck       : None
brotli           : None
fastparquet      : 2023.7.0
fsspec           : 2023.6.0
gcsfs            : None
matplotlib       : 3.7.2
numba            : 0.57.1
numexpr          : 2.8.4
odfpy            : None
openpyxl         : 3.1.2
pandas_gbq       : None
pyarrow          : 12.0.1
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : 1.11.1
snappy           : None
sqlalchemy       : 1.4.49
tables           : 3.8.0
tabulate         : 0.9.0
xarray           : 2023.7.0
xlrd             : None
zstandard        : None
tzdata           : 2023.3
qtpy             : None
pyqt5            : None

I will try on another machine when I'm home.

@randolf-scholz
Copy link
Contributor

randolf-scholz commented Jul 20, 2023

I could reproduce it on my desktop as well. Both machines run Ubuntu 22.04, desktop CPU: AMD 3900X, Laptop CPU: i7-11800H. Tried both python 3.10 and 3.11.

@randolf-scholz
Copy link
Contributor

I was able to reproduce it in google-colab as well: https://colab.research.google.com/drive/1kZVSEmLOWXGLV8uRvrc2ACFXdiqoq0cd?usp=sharing

@randolf-scholz
Copy link
Contributor

Should I open a separate issue for this?

@jbrockmendel
Copy link
Member Author

Sure, let's see if anyone else can reproduce

@randolf-scholz
Copy link
Contributor

randolf-scholz commented Jul 20, 2023

Opened #54208

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: group by manipulation is slower with new arrow engine
4 participants