-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: agg
is an order of magnitude slower with pyarrow
dtypes
#54065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I don't think groupby has been implemented for pyarrow objects yet |
Thanks for the report; for the two timings in the OP, on main I get
Closing as a duplicate of #52070. |
@rhshadrach Thanks for that, I see it'll be in the However, it's still significantly slower, whereas one should expect much faster vectorized operations when dealing with 16 bit data types. That's one of the benefits of the conversion. #52070 says
We still observe a 20-30% slow-down in test cases, not sure that this counts as resolved. |
Fair point; reopening. Further investigations on the perf difference are welcome! |
Is the idea to avoid the conversion to numpy and to do the entire groupby operation pyarrow side? |
@rhshadrach Thanks, I'll see what I can do. @lithomas1 Yes, it seems switching between different backends is a little bit cumbersome. |
@rhshadrach could you kindly point me in the right direction (which section of the code should I look at) to work on this? |
@samukweku - I recommend starting by profiling the code for NumPy vs pyarrow dtypes and seeing where significant differences pop out. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
Conversion to
pyarrow
datatypes changes the performance drastically. I did a bit of profiling, it looks likeagg
is to blame here. With the recent introduction of PEP 668 testing the code on the latest branch is cumbersome and so I didn't. There are also potentially relevant issues, but it's not the same: #50121, #46505Installed Versions
INSTALLED VERSIONS
commit : 965ceca
python : 3.11.3.final.0
python-bits : 64
OS : Linux
OS-release : 6.4.2-arch1-1
Version : #1 SMP PREEMPT_DYNAMIC Thu, 06 Jul 2023 18:35:54 +0000
machine : x86_64
processor :
byteorder : little
LC_ALL : en_GB.UTF-8
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 2.0.2
numpy : 1.25.0
pytz : 2023.3
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.1.2
Cython : 0.29.36
pytest : 7.4.0
hypothesis : 6.75.3
sphinx : 7.0.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.14.0
pandas_datareader: None
bs4 : 4.12.2
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.11.0
gcsfs : None
matplotlib : 3.7.1
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.1
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : 2023.1.0
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
Prior Performance
No response
The text was updated successfully, but these errors were encountered: