-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Categorical columns are slow in groupby operations #32976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Initially, I thought this would be due to unobserved categories, but the issue is present even with |
I'm also seeing this on pandas 1.0.3. As with OP, the performance difference is astronomical between strings and categories.
|
Ideally we would pass the categorical |
Aggregate categorical codes with fast cython aggregation for select `how` operations.
Aggregate categorical codes with fast cython aggregation for select `how` operations.
Aggregate categorical codes with fast cython aggregation for select `how` operations.
Aggregate categorical codes with fast cython aggregation for select `how` operations.
Aggregate categorical codes with fast cython aggregation for select `how` operations.
Aggregate categorical codes with fast cython aggregation for select `how` operations.
Aggregate categorical codes with fast cython aggregation for select `how` operations.
Aggregate categorical codes with fast cython aggregation for select `how` operations.
Aggregate categorical codes with fast cython aggregation for select `how` operations.
Aggregate categorical codes with fast cython aggregation for select `how` operations.
Aggregate categorical codes with fast cython aggregation for select `how` operations.
Aggregate categorical codes with fast cython aggregation for select `how` operations.
Aggregate categorical codes with fast cython aggregation for select `how` operations.
Aggregate categorical codes with fast cython aggregation for select `how` operations.
Aggregate categorical codes with fast cython aggregation for select `how` operations.
Aggregate categorical codes with fast cython aggregation for select `how` operations.
Aggregate categorical codes with fast cython aggregation for select `how` operations.
Also seeing this bug. A faster work-around seems to be to use merge/drop_duplicates: cats = pd.Categorical(range(10**6))
df = pd.DataFrame(dict(x=cats, y=cats))
%timeit df.astype(str).groupby(['x', 'y']).count()
# 2.84 s ± 62.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.merge(df[['x', 'y']].drop_duplicates().reset_index()).groupby('index').count()
# 845 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) |
Aggregate categorical codes with fast cython aggregation for select `how` operations.
Aggregate categorical codes with fast cython aggregation for select `how` operations. 8/1/20: rebase and move release note to 1.2
Aggregate categorical codes with fast cython aggregation for select `how` operations. 8/1/20: rebase and move release note to 1.2
Aggregate categorical codes with fast cython aggregation for select `how` operations. 8/1/20: rebase and move release note to 1.2 8/2/20: Update tests to expect categorical back
Aggregate categorical codes with fast cython aggregation for select `how` operations. 8/1/20: rebase and move release note to 1.2 8/2/20: Update tests to expect categorical back 8/3/20: add PR as issue for whatsnew groupby api change
Closed by #52120. The df_cat.groupby("group").last() is now slightly faster than the |
Code Sample, a copy-pastable example if possible
Problem description
We have a dataframe like the above in production and noticed a particular groupby call was drastically slowing down our pipeline. We were able to workaround the issue by converting categorical types to strings before the groupby operation.
The issue occurs when categorical values are in the grouped frame (not when they're included in the groupby index). The problem scales with the number of groups. The larger the GROUPS value in the snippet above, the worse the timing differences are.
We didn't expect using categories to cause these kind of performance issues. Is this expected, or could it be something worth optimizing for?
Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]INSTALLED VERSIONS
commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 4.14.137+
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 0.25.3
numpy : 1.18.2
pytz : 2018.9
dateutil : 2.8.1
pip : 19.3.1
setuptools : 46.0.0
Cython : 0.29.15
pytest : 3.6.4
hypothesis : None
sphinx : 1.8.5
blosc : None
feather : 0.4.0
xlsxwriter : None
lxml.etree : 4.2.6
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.7.6.1 (dt dec pq3 ext lo64)
jinja2 : 2.11.1
IPython : 5.5.0
pandas_datareader: 0.7.4
bs4 : 4.6.3
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.2.6
matplotlib : 3.2.1
numexpr : 2.7.1
odfpy : None
openpyxl : 2.5.9
pandas_gbq : 0.11.0
pyarrow : 0.14.1
pytables : None
s3fs : 0.4.0
scipy : 1.4.1
sqlalchemy : 1.3.15
tables : 3.4.4
xarray : 0.15.0
xlrd : 1.1.0
xlwt : 1.3.0
xlsxwriter : None
The text was updated successfully, but these errors were encountered: