Skip to content

Categorical columns are slow in groupby operations #32976

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jeremywhelchel opened this issue Mar 24, 2020 · 6 comments
Closed

Categorical columns are slow in groupby operations #32976

jeremywhelchel opened this issue Mar 24, 2020 · 6 comments
Labels
Categorical Categorical Data Type Groupby Performance Memory or execution speed performance

Comments

@jeremywhelchel
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
import random

SIZE = 100000
GROUPS = 10000  # The larger, the more extreme the timing differences
CARDINALITY = 10
CAT = pd.CategoricalDtype([i for i in range(CARDINALITY)])
df_int = pd.DataFrame({
    'group': [random.randint(0, GROUPS) for i in range(SIZE)],
    'cat': [random.choice(CAT.categories) for i in range(SIZE)],
})
df_cat = df_int.astype({'cat': CAT})
df_str = df_cat.astype({'cat': 'str'})

%timeit df_int.groupby('group').last()
# 100 loops, best of 3: 5.51 ms per loop
%timeit df_str.groupby('group').last()
# 100 loops, best of 3: 11.4 ms per loop
%timeit df_cat.groupby('group').last()
# 1 loop, best of 3: 4.62 s per loop

Problem description

We have a dataframe like the above in production and noticed a particular groupby call was drastically slowing down our pipeline. We were able to workaround the issue by converting categorical types to strings before the groupby operation.

The issue occurs when categorical values are in the grouped frame (not when they're included in the groupby index). The problem scales with the number of groups. The larger the GROUPS value in the snippet above, the worse the timing differences are.

We didn't expect using categories to cause these kind of performance issues. Is this expected, or could it be something worth optimizing for?

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 4.14.137+
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.3
numpy : 1.18.2
pytz : 2018.9
dateutil : 2.8.1
pip : 19.3.1
setuptools : 46.0.0
Cython : 0.29.15
pytest : 3.6.4
hypothesis : None
sphinx : 1.8.5
blosc : None
feather : 0.4.0
xlsxwriter : None
lxml.etree : 4.2.6
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.7.6.1 (dt dec pq3 ext lo64)
jinja2 : 2.11.1
IPython : 5.5.0
pandas_datareader: 0.7.4
bs4 : 4.6.3
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.2.6
matplotlib : 3.2.1
numexpr : 2.7.1
odfpy : None
openpyxl : 2.5.9
pandas_gbq : 0.11.0
pyarrow : 0.14.1
pytables : None
s3fs : 0.4.0
scipy : 1.4.1
sqlalchemy : 1.3.15
tables : 3.4.4
xarray : 0.15.0
xlrd : 1.1.0
xlwt : 1.3.0
xlsxwriter : None

@TomAugspurger
Copy link
Contributor

Initially, I thought this would be due to unobserved categories, but the issue is present even with observed=True. The slowdown seems to be from doing the categorical aggregation in Python (_aggregate_series_pure_python) rather than Cython (_cython_agg_general). I suppose that Categorical cannot be passed to our Cython aggfuncs, which expects a 2d ndarray.

@TomAugspurger TomAugspurger added Performance Memory or execution speed performance Categorical Categorical Data Type Groupby labels Mar 24, 2020
@rben01
Copy link

rben01 commented Mar 25, 2020

I'm also seeing this on pandas 1.0.3. As with OP, the performance difference is astronomical between strings and categories.

pd.show_versions():

INSTALLED VERSIONS    
------------------
commit           : None
python           : 3.8.1.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.4.0
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.0.3
numpy            : 1.18.1
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 46.0.0.post20200309
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : 7.13.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : 3.1.3
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 0.15.1
pytables         : None
pytest           : None
pyxlsb           : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
numba            : None

@rtlee9
Copy link
Contributor

rtlee9 commented Apr 20, 2020

BaseGrouper raises a notImplemented exception here for categorical dtypes but not other dtypes. My understanding is this results in a slower, python based aggregation here. Happy to open a PR but not sure of the best approach yet.

@TomAugspurger
Copy link
Contributor

Ideally we would pass the categorical codes down to cython, group by those, and then convert back to a CategoricalIndex when we're done in Cython.

rtlee9 added a commit to rtlee9/pandas that referenced this issue Apr 22, 2020
Aggregate categorical codes with fast cython aggregation for select
`how` operations.
rtlee9 added a commit to rtlee9/pandas that referenced this issue Apr 23, 2020
Aggregate categorical codes with fast cython aggregation for select
`how` operations.
rtlee9 added a commit to rtlee9/pandas that referenced this issue Apr 23, 2020
Aggregate categorical codes with fast cython aggregation for select
`how` operations.
rtlee9 added a commit to rtlee9/pandas that referenced this issue Apr 23, 2020
Aggregate categorical codes with fast cython aggregation for select
`how` operations.
rtlee9 added a commit to rtlee9/pandas that referenced this issue Apr 24, 2020
Aggregate categorical codes with fast cython aggregation for select
`how` operations.
rtlee9 added a commit to rtlee9/pandas that referenced this issue Apr 25, 2020
Aggregate categorical codes with fast cython aggregation for select
`how` operations.
rtlee9 added a commit to rtlee9/pandas that referenced this issue Apr 25, 2020
Aggregate categorical codes with fast cython aggregation for select
`how` operations.
rtlee9 added a commit to rtlee9/pandas that referenced this issue Apr 26, 2020
Aggregate categorical codes with fast cython aggregation for select
`how` operations.
rtlee9 added a commit to rtlee9/pandas that referenced this issue Apr 26, 2020
Aggregate categorical codes with fast cython aggregation for select
`how` operations.
rtlee9 added a commit to rtlee9/pandas that referenced this issue Apr 26, 2020
Aggregate categorical codes with fast cython aggregation for select
`how` operations.
rtlee9 added a commit to rtlee9/pandas that referenced this issue Apr 27, 2020
Aggregate categorical codes with fast cython aggregation for select
`how` operations.
rtlee9 added a commit to rtlee9/pandas that referenced this issue Apr 27, 2020
Aggregate categorical codes with fast cython aggregation for select
`how` operations.
rtlee9 added a commit to rtlee9/pandas that referenced this issue Apr 28, 2020
Aggregate categorical codes with fast cython aggregation for select
`how` operations.
rtlee9 added a commit to rtlee9/pandas that referenced this issue Apr 28, 2020
Aggregate categorical codes with fast cython aggregation for select
`how` operations.
rtlee9 added a commit to rtlee9/pandas that referenced this issue Apr 30, 2020
Aggregate categorical codes with fast cython aggregation for select
`how` operations.
rtlee9 added a commit to rtlee9/pandas that referenced this issue May 9, 2020
Aggregate categorical codes with fast cython aggregation for select
`how` operations.
rtlee9 added a commit to rtlee9/pandas that referenced this issue May 26, 2020
Aggregate categorical codes with fast cython aggregation for select
`how` operations.
@adbull
Copy link
Contributor

adbull commented May 26, 2020

Also seeing this bug. A faster work-around seems to be to use merge/drop_duplicates:

cats = pd.Categorical(range(10**6))
df = pd.DataFrame(dict(x=cats, y=cats))
%timeit df.astype(str).groupby(['x', 'y']).count()
# 2.84 s ± 62.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.merge(df[['x', 'y']].drop_duplicates().reset_index()).groupby('index').count()
# 845 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

rtlee9 added a commit to rtlee9/pandas that referenced this issue May 29, 2020
Aggregate categorical codes with fast cython aggregation for select
`how` operations.
rtlee9 added a commit to rtlee9/pandas that referenced this issue Aug 2, 2020
Aggregate categorical codes with fast cython aggregation for select
`how` operations.

8/1/20: rebase and move release note to 1.2
rtlee9 added a commit to rtlee9/pandas that referenced this issue Aug 2, 2020
Aggregate categorical codes with fast cython aggregation for select
`how` operations.

8/1/20: rebase and move release note to 1.2
rtlee9 added a commit to rtlee9/pandas that referenced this issue Aug 2, 2020
Aggregate categorical codes with fast cython aggregation for select
`how` operations.

8/1/20: rebase and move release note to 1.2
8/2/20: Update tests to expect categorical back
rtlee9 added a commit to rtlee9/pandas that referenced this issue Aug 4, 2020
Aggregate categorical codes with fast cython aggregation for select
`how` operations.

8/1/20: rebase and move release note to 1.2
8/2/20: Update tests to expect categorical back
8/3/20: add PR as issue for whatsnew groupby api change
@jbrockmendel
Copy link
Member

Closed by #52120. The df_cat.groupby("group").last() is now slightly faster than the df_int... call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants