BUG: very slow groupby(col1)[col2].value_counts() for columns of type 'category' #46202

whypandasslow · 2022-03-02T16:31:14Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# EXAMPLE 1

import numpy as np
import pandas as pd
import time

t0 = time.time()
col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]

col1_values = np.random.choice(col1_possible_values, size=13000000, replace=True)
col2_values = np.random.choice(col2_possible_values, size=13000000, replace=True)

sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])
print(time.time()-t0)

t0 = time.time()
processed_df = sample_df.groupby("col1")["col2"].value_counts().unstack()
print(time.time()-t0)

# EXAMPLE 2

import numpy as np
import pandas as pd
import time

t0 = time.time()
col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]

col1_values = np.random.choice(col1_possible_values, size=13000000, replace=True)
col2_values = np.random.choice(col2_possible_values, size=13000000, replace=True)

sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])
sample_df['col2'] = sample_df['col2'].astype('category')
print(time.time()-t0)

t0 = time.time()
processed_df = sample_df.groupby("col1")["col2"].value_counts().unstack()
print(time.time()-t0)

Issue Description

I have 2022 Macbook Pro M1 Pro, pandas 1.4.1, numpy 1.22.2

I noticed significant performance drop when trying to perform

sample_df.groupby("col1")["col2"].value_counts().unstack()

when col2 or both col1 and col2 are of type 'category' instead of default type 'object'. Operation in EXAMPLE 1 runs around ~25 seconds on my computer (similar for 2019 Macbook Pro with Intel processor). In EXAMPLE 2 I have run the operation for more than 20 minutes and it still did not finish (on 2019 Macbook Pro with Intel processor running time is similar for EXAMPLE 1 and EXAMPLE 2).

Moreover, for M1 in EXAMPLE 1, the peak memory usage is around 8-10GB while for EXAMPLE 2 it well exceeds 30GB.

Expected Behavior

on M1:

EXAMPLE 1 and EXAMPLE 2 should perform roughly the same as they do on Intel processor.

Installed Versions

INSTALLED VERSIONS

commit : 06d2301
python : 3.9.10.final.0
python-bits : 64
OS : Darwin
OS-release : 21.3.0
Version : Darwin Kernel Version 21.3.0: Wed Jan 5 21:37:58 PST 2022; root:xnu-8019.80.24~20/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.4.1
numpy : 1.22.2
pytz : 2021.3
dateutil : 2.8.2
pip : 22.0.3
setuptools : 60.9.3
Cython : 0.29.28
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.3
jinja2 : 3.0.3
IPython : 8.0.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.5.1
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 7.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
None

The text was updated successfully, but these errors were encountered:

samukweku · 2022-03-02T20:52:01Z

@whypandasslow What happens if you set observed=True in the group by? Since you are dealing with categorical, this might offer a performance improvement

whypandasslow · 2022-03-02T21:41:18Z

Hey @samukweku . I tried it and almost no improvement at all. And whether it is turned on or off, the difference in running time between EXAMPLE 1 and EXAMPLE 2 is in factor of hundreds

rhshadrach · 2022-03-02T22:49:57Z

When value_counts was added to DataFrameGroupBy, I noticed that it seemed to perform better than the SeriesGroupBy implementation. That seems to be the case here when combined with observed=True. Can you verify the performance of:

processed_df = sample_df.groupby("col1", observed=True)[["col2"]].value_counts().unstack()

whypandasslow · 2022-03-03T18:22:28Z

Hey @rhshadrach! I have checked 4 cases on a reduced sample of 130000 records:
example 1 using dataframe approach - 0.356 seconds
example 1 using dataframe approach - 0.3622 seconds
example 2 using series approach - 2509 seconds
example 2 using dataframe approach - 43 seconds

I see only slight improvement for dataframe approach when a column is of type object, however I see HUGE improvement when a column is of type category. Yet still, leaving the column as object yields much better running-time (in the factor of 100x) then when it is converted to category.

So the performance problem for the operation on columns of type 'category' still persits no matter what approach is used

rhshadrach · 2022-03-05T02:19:35Z

@whypandasslow - I'm not seeing the same. Are you by chance missing setting observed=True? I am using the following code

Code

import numpy as np
import pandas as pd
import time

size = 130000

t0 = time.time()
col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]

col1_values = np.random.choice(col1_possible_values, size=size, replace=True)
col2_values = np.random.choice(col2_possible_values, size=size, replace=True)

sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])
print(time.time()-t0)

t0 = time.time()
processed_df = sample_df.groupby("col1")["col2"].value_counts().unstack()
print(time.time()-t0)

# EXAMPLE 2

t0 = time.time()
col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]

col1_values = np.random.choice(col1_possible_values, size=size, replace=True)
col2_values = np.random.choice(col2_possible_values, size=size, replace=True)

sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])
sample_df['col2'] = sample_df['col2'].astype('category')
print(time.time()-t0)

t0 = time.time()
processed_df = sample_df.groupby("col1", observed=True)[["col2"]].value_counts().unstack()
print(time.time()-t0)

And getting the output

16.620705127716064
0.5117876529693604
16.49775242805481
0.5147149562835693

showing no significant difference between SeriesGroupBy with object dtype and DataFrameGroupBy with categorical dtype.

whypandasslow · 2022-03-07T16:25:24Z

@rhshadrach, @samukweku instead copy-pasting @rhshadrach 's line, I just added another layer of square brackets around "col2" to make it a dataframe. This is how I got my 43 seconds. Using both, Dataframe approach and parmeter observed=True I got similar results to those of @rhshadrach. Conclusion, working with column of type 'category', one must use both observed=True and Dataframe approach. Doing default group by on series yields horrible results. Doin group by on dataframe improves the performance but only combined with observed=True yields best performance. Wat is more, compbinig observed=True with series group by gives no improvement over series group by without observed=True. In the end, using both group by on dataframe and observed=True on columns of type 'category' I got better perfomance than with group by on series on columns of type 'object'. Thank you @rhshadrach and @samukweku! My problem is solved!

rhshadrach · 2022-03-07T21:51:57Z

@whypandasslow - thanks for the response. Using DataFrameGroupby is merely a workaround. The performance issues in SeriesGroupBy should be fixed.

LucasG0 · 2022-03-10T22:47:22Z

Hi, I am interested in working on this issue. May I take it ?

LucasG0 · 2022-03-11T00:24:10Z

In the output of SeriesGroupBy.value_counts with categoricals, each group has as many rows as the number of categories, thus producing many extra rows. I don't think it is expected, and fixing this should probably fix performances too.

>>> df = pd.DataFrame(zip(["A", "B", "C"], ["X", "Y", "Z"]), columns=["col1", "col2"])
>>> df
	col1	col2
0	A	X
1	B	Y
2	C	Z
>>> df.groupby("col1")["col2"].value_counts()
col1  col2
A     X       1
B     Y       1
C     Z       1
Name: col2, dtype: int64

>>> df['col2'] = df['col2'].astype('category')
>>> df.groupby("col1")["col2"].value_counts()
col1   
A     X    1
      Y    0
      Z    0
B     Y    1
      X    0
      Z    0
C     Z    1
      X    0
      Y    0
Name: col2, dtype: int64

samukweku · 2022-03-11T01:36:23Z

@LucasG0 if you pass observed=True to the groupby , it should reduce the number of rows

LucasG0 · 2022-03-12T15:48:06Z

observed=True does not reduce the number of rows if used with a SeriesGroupBy, but it does for a DataFrameGroupBy (above workaround).

>>> df2.groupby("col1", observed=True)["col2"].value_counts()
col1   
A     X    1
      Y    0
      Z    0
B     Y    1
      X    0
      Z    0
C     Z    1
      X    0
      Y    0
Name: col2, dtype: int64

>>> df2.groupby("col1")[["col2"]].value_counts()
col1  col2
A     X       1
      Y       0
      Z       0
B     Y       1
      X       0
      Z       0
C     Z       1
      X       0
      Y       0
dtype: int64
>>> df2.groupby("col1", observed=True)[["col2"]].value_counts()
col1  col2
A     X       1
B     Y       1
C     Z       1
dtype: int64

rhshadrach · 2022-03-12T16:49:05Z

@LucasG0 - thanks for finding that SeriesGroupBy.value_counts does not implement observed=True properly. It appears that is a separate bug. Can you see if there is already an issue for this, and create one if there is not.

LucasG0 · 2022-03-13T17:59:13Z

@rhshadrach It does not seem it has been reported yet. I created the issue and also detailled why I have a doubt whether the observed parameter should have an effect here

LucasG0 · 2022-03-13T22:21:07Z

I observed the same performance issue than above on 130k records, and I think the issue here is between SeriesGroupBy.value_counts and DataFrameGroupBy.value_counts on categoricals:

%%time
df2.groupby("col1")[["col2"]].value_counts()
CPU times: total: 38.7 s
Wall time: 39.8 s

%%time
df2.groupby("col1")["col2"].value_counts()
CPU times: total: 15min 27s
Wall time: 15min 28s

Also, as discussed in #46357 DataFrameGroupBy.value_counts result should not be impacted by observed parameter in this case. Once it is effective, the above workaround combining DataFrameGroupBy version + observed=True won't be relevant anymore, and I think the workaround (less efficient though) will be to come back to object dtype for this operation (unless an observed parameter is introduced to value_counts API).

rhshadrach · 2022-03-15T01:58:38Z

Also, as discussed in #46357 DataFrameGroupBy.value_counts result should not be impacted by observed parameter in this case. Once it is effective, the above workaround combining DataFrameGroupBy version + observed=True won't be relevant anymore, and I think the workaround (less efficient though) will be to come back to object dtype for this operation (unless an observed parameter is introduced to value_counts API).

While this is partially true, I still think there is a significant performance issue with SeriesGroupBy. I changed the line

pandas/pandas/core/groupby/generic.py

Line 1731 in 48d5159

observed=self.observed,

to be observed=False. I think this fixes the bug in #46357. Doing

df = pd.DataFrame(zip(["A", "B", "C"], ["X", "Y", "Z"]), columns=["col1", "col2"])
df['col2'] = df['col2'].astype('category')
print(df.groupby("col1")[["col2"]].value_counts())

gives

col1  col2
A     X       1
      Y       0
      Z       0
B     Y       1
      X       0
      Z       0
C     Z       1
      X       0
      Y       0
dtype: int64

Now using a modified example from #46202 (comment)

Code

import numpy as np
import pandas as pd
import time

size = 13000

col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]

col1_values = np.random.choice(col1_possible_values, size=size, replace=True)
col2_values = np.random.choice(col2_possible_values, size=size, replace=True)

sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])

t0 = time.time()
processed_df = sample_df.groupby("col1")["col2"].value_counts().unstack()
print(time.time()-t0)

# EXAMPLE 2

col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]

col1_values = np.random.choice(col1_possible_values, size=size, replace=True)
col2_values = np.random.choice(col2_possible_values, size=size, replace=True)

sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])
sample_df['col2'] = sample_df['col2'].astype('category')

t0 = time.time()
processed_df = sample_df.groupby("col1", observed=True)["col2"].value_counts().unstack()
print(time.time()-t0)

t0 = time.time()
processed_df = sample_df.groupby("col1", observed=True)[["col2"]].value_counts().unstack()
print(time.time()-t0)

I get the timings

0.047942399978637695 # SeriesGroupBy, object
25.37419080734253    # SeriesGroupBy, category
3.2430622577667236   # DataFrameGroupBy, category

LucasG0 · 2022-03-15T21:48:05Z

While this is partially true, I still think there is a significant performance issue with SeriesGroupBy

Yes, what I meant is that when both this issue and #46357 are fixed, it won't be possible to retrieve only observed values in order to get the best performances using categorical dtype, but a prior casting to object dtype will be needed.

LucasG0 · 2022-03-15T21:48:21Z

take

whypandasslow added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 2, 2022

whypandasslow changed the title ~~BUG: very slow groupby()[col].value_counts() for columns of type 'category'~~ BUG: very slow groupby(col1)[col2].value_counts() for columns of type 'category' Mar 2, 2022

rhshadrach added Groupby Performance Memory or execution speed performance Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Mar 2, 2022

whypandasslow closed this as completed Mar 7, 2022

rhshadrach reopened this Mar 7, 2022

rhshadrach removed the Needs Triage Issue that has not been reviewed by a pandas team member label Mar 7, 2022

rhshadrach added this to the Contributions Welcome milestone Mar 7, 2022

LucasG0 mentioned this issue Mar 13, 2022

BUG: value_counts(observed=True) inconsistent between SeriesGroupBy and DataFrameGroupBy #46357

Closed

3 tasks

github-actions bot assigned LucasG0 Mar 15, 2022

rhshadrach mentioned this issue Mar 24, 2022

ENH: Add 'observed' parameter to value_counts #46486

Open

LucasG0 mentioned this issue May 4, 2022

PERF: Improve SeriesGroupBy.value_counts performances with categorical values. #46940

Closed

4 tasks

jreback modified the milestones: Contributions Welcome, 1.5 May 4, 2022

LucasG0 removed their assignment Jul 15, 2022

mroeschke removed this from the 1.5 milestone Aug 15, 2022

rhshadrach added the Categorical Categorical Data Type label Jan 4, 2023

rhshadrach self-assigned this Jan 4, 2023

rhshadrach mentioned this issue Jan 4, 2023

BUG/PERF: SeriesGroupBy.value_counts sorting bug and categorical performance #50548

Merged

6 tasks

rhshadrach added this to the 2.0 milestone Jan 11, 2023

rhshadrach closed this as completed in #50548 Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: very slow groupby(col1)[col2].value_counts() for columns of type 'category' #46202

BUG: very slow groupby(col1)[col2].value_counts() for columns of type 'category' #46202

whypandasslow commented Mar 2, 2022 •

edited

Loading

INSTALLED VERSIONS

samukweku commented Mar 2, 2022 •

edited

Loading

whypandasslow commented Mar 2, 2022

rhshadrach commented Mar 2, 2022 •

edited

Loading

whypandasslow commented Mar 3, 2022

rhshadrach commented Mar 5, 2022 •

edited

Loading

whypandasslow commented Mar 7, 2022

rhshadrach commented Mar 7, 2022

LucasG0 commented Mar 10, 2022

LucasG0 commented Mar 11, 2022 •

edited

Loading

samukweku commented Mar 11, 2022

LucasG0 commented Mar 12, 2022

rhshadrach commented Mar 12, 2022

LucasG0 commented Mar 13, 2022

LucasG0 commented Mar 13, 2022

rhshadrach commented Mar 15, 2022

LucasG0 commented Mar 15, 2022

LucasG0 commented Mar 15, 2022

BUG: very slow groupby(col1)[col2].value_counts() for columns of type 'category' #46202

BUG: very slow groupby(col1)[col2].value_counts() for columns of type 'category' #46202

Comments

whypandasslow commented Mar 2, 2022 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

samukweku commented Mar 2, 2022 • edited Loading

whypandasslow commented Mar 2, 2022

rhshadrach commented Mar 2, 2022 • edited Loading

whypandasslow commented Mar 3, 2022

rhshadrach commented Mar 5, 2022 • edited Loading

whypandasslow commented Mar 7, 2022

rhshadrach commented Mar 7, 2022

LucasG0 commented Mar 10, 2022

LucasG0 commented Mar 11, 2022 • edited Loading

samukweku commented Mar 11, 2022

LucasG0 commented Mar 12, 2022

rhshadrach commented Mar 12, 2022

LucasG0 commented Mar 13, 2022

LucasG0 commented Mar 13, 2022

rhshadrach commented Mar 15, 2022

LucasG0 commented Mar 15, 2022

LucasG0 commented Mar 15, 2022

whypandasslow commented Mar 2, 2022 •

edited

Loading

samukweku commented Mar 2, 2022 •

edited

Loading

rhshadrach commented Mar 2, 2022 •

edited

Loading

rhshadrach commented Mar 5, 2022 •

edited

Loading

LucasG0 commented Mar 11, 2022 •

edited

Loading