-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: very slow groupby(col1)[col2].value_counts() for columns of type 'category' #46202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@whypandasslow What happens if you set |
Hey @samukweku . I tried it and almost no improvement at all. And whether it is turned on or off, the difference in running time between EXAMPLE 1 and EXAMPLE 2 is in factor of hundreds |
When
|
Hey @rhshadrach! I have checked 4 cases on a reduced sample of 130000 records: I see only slight improvement for dataframe approach when a column is of type object, however I see HUGE improvement when a column is of type category. Yet still, leaving the column as object yields much better running-time (in the factor of 100x) then when it is converted to category. So the performance problem for the operation on columns of type 'category' still persits no matter what approach is used |
@whypandasslow - I'm not seeing the same. Are you by chance missing setting Code
And getting the output
showing no significant difference between SeriesGroupBy with object dtype and DataFrameGroupBy with categorical dtype. |
@rhshadrach, @samukweku instead copy-pasting @rhshadrach 's line, I just added another layer of square brackets around "col2" to make it a dataframe. This is how I got my 43 seconds. Using both, Dataframe approach and parmeter observed=True I got similar results to those of @rhshadrach. Conclusion, working with column of type 'category', one must use both observed=True and Dataframe approach. Doing default group by on series yields horrible results. Doin group by on dataframe improves the performance but only combined with observed=True yields best performance. Wat is more, compbinig observed=True with series group by gives no improvement over series group by without observed=True. In the end, using both group by on dataframe and observed=True on columns of type 'category' I got better perfomance than with group by on series on columns of type 'object'. Thank you @rhshadrach and @samukweku! My problem is solved! |
@whypandasslow - thanks for the response. Using DataFrameGroupby is merely a workaround. The performance issues in SeriesGroupBy should be fixed. |
Hi, I am interested in working on this issue. May I take it ? |
In the output of >>> df = pd.DataFrame(zip(["A", "B", "C"], ["X", "Y", "Z"]), columns=["col1", "col2"])
>>> df
col1 col2
0 A X
1 B Y
2 C Z
>>> df.groupby("col1")["col2"].value_counts()
col1 col2
A X 1
B Y 1
C Z 1
Name: col2, dtype: int64
>>> df['col2'] = df['col2'].astype('category')
>>> df.groupby("col1")["col2"].value_counts()
col1
A X 1
Y 0
Z 0
B Y 1
X 0
Z 0
C Z 1
X 0
Y 0
Name: col2, dtype: int64 |
@LucasG0 if you pass |
>>> df2.groupby("col1", observed=True)["col2"].value_counts()
col1
A X 1
Y 0
Z 0
B Y 1
X 0
Z 0
C Z 1
X 0
Y 0
Name: col2, dtype: int64
>>> df2.groupby("col1")[["col2"]].value_counts()
col1 col2
A X 1
Y 0
Z 0
B Y 1
X 0
Z 0
C Z 1
X 0
Y 0
dtype: int64
>>> df2.groupby("col1", observed=True)[["col2"]].value_counts()
col1 col2
A X 1
B Y 1
C Z 1
dtype: int64 |
@LucasG0 - thanks for finding that SeriesGroupBy.value_counts does not implement |
@rhshadrach It does not seem it has been reported yet. I created the issue and also detailled why I have a doubt whether the |
I observed the same performance issue than above on 130k records, and I think the issue here is between %%time
df2.groupby("col1")[["col2"]].value_counts()
CPU times: total: 38.7 s
Wall time: 39.8 s
%%time
df2.groupby("col1")["col2"].value_counts()
CPU times: total: 15min 27s
Wall time: 15min 28s Also, as discussed in #46357 |
While this is partially true, I still think there is a significant performance issue with SeriesGroupBy. I changed the line pandas/pandas/core/groupby/generic.py Line 1731 in 48d5159
to be
gives
Now using a modified example from #46202 (comment) Code
I get the timings
|
Yes, what I meant is that when both this issue and #46357 are fixed, it won't be possible to retrieve only observed values in order to get the best performances using categorical dtype, but a prior casting to |
take |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
I have 2022 Macbook Pro M1 Pro, pandas 1.4.1, numpy 1.22.2
I noticed significant performance drop when trying to perform
sample_df.groupby("col1")["col2"].value_counts().unstack()
when col2 or both col1 and col2 are of type 'category' instead of default type 'object'. Operation in EXAMPLE 1 runs around ~25 seconds on my computer (similar for 2019 Macbook Pro with Intel processor). In EXAMPLE 2 I have run the operation for more than 20 minutes and it still did not finish (on 2019 Macbook Pro with Intel processor running time is similar for EXAMPLE 1 and EXAMPLE 2).
Moreover, for M1 in EXAMPLE 1, the peak memory usage is around 8-10GB while for EXAMPLE 2 it well exceeds 30GB.
Expected Behavior
on M1:
EXAMPLE 1 and EXAMPLE 2 should perform roughly the same as they do on Intel processor.
Installed Versions
INSTALLED VERSIONS
commit : 06d2301
python : 3.9.10.final.0
python-bits : 64
OS : Darwin
OS-release : 21.3.0
Version : Darwin Kernel Version 21.3.0: Wed Jan 5 21:37:58 PST 2022; root:xnu-8019.80.24~20/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8
pandas : 1.4.1
numpy : 1.22.2
pytz : 2021.3
dateutil : 2.8.2
pip : 22.0.3
setuptools : 60.9.3
Cython : 0.29.28
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.3
jinja2 : 3.0.3
IPython : 8.0.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.5.1
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 7.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
None
The text was updated successfully, but these errors were encountered: