Skip to content

BUG: very slow groupby(col1)[col2].value_counts() for columns of type 'category' #46202

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
whypandasslow opened this issue Mar 2, 2022 · 17 comments · Fixed by #50548
Closed
3 tasks done
Assignees
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Categorical Categorical Data Type Groupby Performance Memory or execution speed performance
Milestone

Comments

@whypandasslow
Copy link

whypandasslow commented Mar 2, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# EXAMPLE 1

import numpy as np
import pandas as pd
import time

t0 = time.time()
col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]

col1_values = np.random.choice(col1_possible_values, size=13000000, replace=True)
col2_values = np.random.choice(col2_possible_values, size=13000000, replace=True)

sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])
print(time.time()-t0)

t0 = time.time()
processed_df = sample_df.groupby("col1")["col2"].value_counts().unstack()
print(time.time()-t0)

# EXAMPLE 2

import numpy as np
import pandas as pd
import time

t0 = time.time()
col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]

col1_values = np.random.choice(col1_possible_values, size=13000000, replace=True)
col2_values = np.random.choice(col2_possible_values, size=13000000, replace=True)

sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])
sample_df['col2'] = sample_df['col2'].astype('category')
print(time.time()-t0)

t0 = time.time()
processed_df = sample_df.groupby("col1")["col2"].value_counts().unstack()
print(time.time()-t0)

Issue Description

I have 2022 Macbook Pro M1 Pro, pandas 1.4.1, numpy 1.22.2

I noticed significant performance drop when trying to perform

sample_df.groupby("col1")["col2"].value_counts().unstack()

when col2 or both col1 and col2 are of type 'category' instead of default type 'object'. Operation in EXAMPLE 1 runs around ~25 seconds on my computer (similar for 2019 Macbook Pro with Intel processor). In EXAMPLE 2 I have run the operation for more than 20 minutes and it still did not finish (on 2019 Macbook Pro with Intel processor running time is similar for EXAMPLE 1 and EXAMPLE 2).

Moreover, for M1 in EXAMPLE 1, the peak memory usage is around 8-10GB while for EXAMPLE 2 it well exceeds 30GB.

Expected Behavior

on M1:

EXAMPLE 1 and EXAMPLE 2 should perform roughly the same as they do on Intel processor.

Installed Versions

INSTALLED VERSIONS

commit : 06d2301
python : 3.9.10.final.0
python-bits : 64
OS : Darwin
OS-release : 21.3.0
Version : Darwin Kernel Version 21.3.0: Wed Jan 5 21:37:58 PST 2022; root:xnu-8019.80.24~20/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.4.1
numpy : 1.22.2
pytz : 2021.3
dateutil : 2.8.2
pip : 22.0.3
setuptools : 60.9.3
Cython : 0.29.28
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.3
jinja2 : 3.0.3
IPython : 8.0.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.5.1
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 7.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
None

@whypandasslow whypandasslow added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 2, 2022
@whypandasslow whypandasslow changed the title BUG: very slow groupby()[col].value_counts() for columns of type 'category' BUG: very slow groupby(col1)[col2].value_counts() for columns of type 'category' Mar 2, 2022
@samukweku
Copy link
Contributor

samukweku commented Mar 2, 2022

@whypandasslow What happens if you set observed=True in the group by? Since you are dealing with categorical, this might offer a performance improvement

@whypandasslow
Copy link
Author

Hey @samukweku . I tried it and almost no improvement at all. And whether it is turned on or off, the difference in running time between EXAMPLE 1 and EXAMPLE 2 is in factor of hundreds

@rhshadrach
Copy link
Member

rhshadrach commented Mar 2, 2022

When value_counts was added to DataFrameGroupBy, I noticed that it seemed to perform better than the SeriesGroupBy implementation. That seems to be the case here when combined with observed=True. Can you verify the performance of:

processed_df = sample_df.groupby("col1", observed=True)[["col2"]].value_counts().unstack()

@rhshadrach rhshadrach added Groupby Performance Memory or execution speed performance Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Mar 2, 2022
@whypandasslow
Copy link
Author

Hey @rhshadrach! I have checked 4 cases on a reduced sample of 130000 records:
example 1 using dataframe approach - 0.356 seconds
example 1 using dataframe approach - 0.3622 seconds
example 2 using series approach - 2509 seconds
example 2 using dataframe approach - 43 seconds

I see only slight improvement for dataframe approach when a column is of type object, however I see HUGE improvement when a column is of type category. Yet still, leaving the column as object yields much better running-time (in the factor of 100x) then when it is converted to category.

So the performance problem for the operation on columns of type 'category' still persits no matter what approach is used

@rhshadrach
Copy link
Member

rhshadrach commented Mar 5, 2022

@whypandasslow - I'm not seeing the same. Are you by chance missing setting observed=True? I am using the following code

Code
import numpy as np
import pandas as pd
import time

size = 130000

t0 = time.time()
col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]

col1_values = np.random.choice(col1_possible_values, size=size, replace=True)
col2_values = np.random.choice(col2_possible_values, size=size, replace=True)

sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])
print(time.time()-t0)

t0 = time.time()
processed_df = sample_df.groupby("col1")["col2"].value_counts().unstack()
print(time.time()-t0)

# EXAMPLE 2

t0 = time.time()
col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]

col1_values = np.random.choice(col1_possible_values, size=size, replace=True)
col2_values = np.random.choice(col2_possible_values, size=size, replace=True)

sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])
sample_df['col2'] = sample_df['col2'].astype('category')
print(time.time()-t0)

t0 = time.time()
processed_df = sample_df.groupby("col1", observed=True)[["col2"]].value_counts().unstack()
print(time.time()-t0)

And getting the output

16.620705127716064
0.5117876529693604
16.49775242805481
0.5147149562835693

showing no significant difference between SeriesGroupBy with object dtype and DataFrameGroupBy with categorical dtype.

@whypandasslow
Copy link
Author

@rhshadrach, @samukweku instead copy-pasting @rhshadrach 's line, I just added another layer of square brackets around "col2" to make it a dataframe. This is how I got my 43 seconds. Using both, Dataframe approach and parmeter observed=True I got similar results to those of @rhshadrach. Conclusion, working with column of type 'category', one must use both observed=True and Dataframe approach. Doing default group by on series yields horrible results. Doin group by on dataframe improves the performance but only combined with observed=True yields best performance. Wat is more, compbinig observed=True with series group by gives no improvement over series group by without observed=True. In the end, using both group by on dataframe and observed=True on columns of type 'category' I got better perfomance than with group by on series on columns of type 'object'. Thank you @rhshadrach and @samukweku! My problem is solved!

@rhshadrach
Copy link
Member

@whypandasslow - thanks for the response. Using DataFrameGroupby is merely a workaround. The performance issues in SeriesGroupBy should be fixed.

@rhshadrach rhshadrach reopened this Mar 7, 2022
@rhshadrach rhshadrach removed the Needs Triage Issue that has not been reviewed by a pandas team member label Mar 7, 2022
@rhshadrach rhshadrach added this to the Contributions Welcome milestone Mar 7, 2022
@LucasG0
Copy link
Contributor

LucasG0 commented Mar 10, 2022

Hi, I am interested in working on this issue. May I take it ?

@LucasG0
Copy link
Contributor

LucasG0 commented Mar 11, 2022

In the output of SeriesGroupBy.value_counts with categoricals, each group has as many rows as the number of categories, thus producing many extra rows. I don't think it is expected, and fixing this should probably fix performances too.

>>> df = pd.DataFrame(zip(["A", "B", "C"], ["X", "Y", "Z"]), columns=["col1", "col2"])
>>> df
	col1	col2
0	A	X
1	B	Y
2	C	Z
>>> df.groupby("col1")["col2"].value_counts()
col1  col2
A     X       1
B     Y       1
C     Z       1
Name: col2, dtype: int64

>>> df['col2'] = df['col2'].astype('category')
>>> df.groupby("col1")["col2"].value_counts()
col1   
A     X    1
      Y    0
      Z    0
B     Y    1
      X    0
      Z    0
C     Z    1
      X    0
      Y    0
Name: col2, dtype: int64

@samukweku
Copy link
Contributor

@LucasG0 if you pass observed=True to the groupby , it should reduce the number of rows

@LucasG0
Copy link
Contributor

LucasG0 commented Mar 12, 2022

observed=True does not reduce the number of rows if used with a SeriesGroupBy, but it does for a DataFrameGroupBy (above workaround).

>>> df2.groupby("col1", observed=True)["col2"].value_counts()
col1   
A     X    1
      Y    0
      Z    0
B     Y    1
      X    0
      Z    0
C     Z    1
      X    0
      Y    0
Name: col2, dtype: int64

>>> df2.groupby("col1")[["col2"]].value_counts()
col1  col2
A     X       1
      Y       0
      Z       0
B     Y       1
      X       0
      Z       0
C     Z       1
      X       0
      Y       0
dtype: int64
>>> df2.groupby("col1", observed=True)[["col2"]].value_counts()
col1  col2
A     X       1
B     Y       1
C     Z       1
dtype: int64

@rhshadrach
Copy link
Member

@LucasG0 - thanks for finding that SeriesGroupBy.value_counts does not implement observed=True properly. It appears that is a separate bug. Can you see if there is already an issue for this, and create one if there is not.

@LucasG0
Copy link
Contributor

LucasG0 commented Mar 13, 2022

@rhshadrach It does not seem it has been reported yet. I created the issue and also detailled why I have a doubt whether the observed parameter should have an effect here

@LucasG0
Copy link
Contributor

LucasG0 commented Mar 13, 2022

I observed the same performance issue than above on 130k records, and I think the issue here is between SeriesGroupBy.value_counts and DataFrameGroupBy.value_counts on categoricals:

%%time
df2.groupby("col1")[["col2"]].value_counts()
CPU times: total: 38.7 s
Wall time: 39.8 s

%%time
df2.groupby("col1")["col2"].value_counts()
CPU times: total: 15min 27s
Wall time: 15min 28s

Also, as discussed in #46357 DataFrameGroupBy.value_counts result should not be impacted by observed parameter in this case. Once it is effective, the above workaround combining DataFrameGroupBy version + observed=True won't be relevant anymore, and I think the workaround (less efficient though) will be to come back to object dtype for this operation (unless an observed parameter is introduced to value_counts API).

@rhshadrach
Copy link
Member

Also, as discussed in #46357 DataFrameGroupBy.value_counts result should not be impacted by observed parameter in this case. Once it is effective, the above workaround combining DataFrameGroupBy version + observed=True won't be relevant anymore, and I think the workaround (less efficient though) will be to come back to object dtype for this operation (unless an observed parameter is introduced to value_counts API).

While this is partially true, I still think there is a significant performance issue with SeriesGroupBy. I changed the line

observed=self.observed,

to be observed=False. I think this fixes the bug in #46357. Doing

df = pd.DataFrame(zip(["A", "B", "C"], ["X", "Y", "Z"]), columns=["col1", "col2"])
df['col2'] = df['col2'].astype('category')
print(df.groupby("col1")[["col2"]].value_counts())

gives

col1  col2
A     X       1
      Y       0
      Z       0
B     Y       1
      X       0
      Z       0
C     Z       1
      X       0
      Y       0
dtype: int64

Now using a modified example from #46202 (comment)

Code
import numpy as np
import pandas as pd
import time

size = 13000

col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]

col1_values = np.random.choice(col1_possible_values, size=size, replace=True)
col2_values = np.random.choice(col2_possible_values, size=size, replace=True)

sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])

t0 = time.time()
processed_df = sample_df.groupby("col1")["col2"].value_counts().unstack()
print(time.time()-t0)

# EXAMPLE 2

col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)]
col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]

col1_values = np.random.choice(col1_possible_values, size=size, replace=True)
col2_values = np.random.choice(col2_possible_values, size=size, replace=True)

sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"])
sample_df['col2'] = sample_df['col2'].astype('category')

t0 = time.time()
processed_df = sample_df.groupby("col1", observed=True)["col2"].value_counts().unstack()
print(time.time()-t0)

t0 = time.time()
processed_df = sample_df.groupby("col1", observed=True)[["col2"]].value_counts().unstack()
print(time.time()-t0)

I get the timings

0.047942399978637695 # SeriesGroupBy, object
25.37419080734253    # SeriesGroupBy, category
3.2430622577667236   # DataFrameGroupBy, category

@LucasG0
Copy link
Contributor

LucasG0 commented Mar 15, 2022

While this is partially true, I still think there is a significant performance issue with SeriesGroupBy

Yes, what I meant is that when both this issue and #46357 are fixed, it won't be possible to retrieve only observed values in order to get the best performances using categorical dtype, but a prior casting to object dtype will be needed.

@LucasG0
Copy link
Contributor

LucasG0 commented Mar 15, 2022

take

@jreback jreback modified the milestones: Contributions Welcome, 1.5 May 4, 2022
@LucasG0 LucasG0 removed their assignment Jul 15, 2022
@mroeschke mroeschke removed this from the 1.5 milestone Aug 15, 2022
@rhshadrach rhshadrach added the Categorical Categorical Data Type label Jan 4, 2023
@rhshadrach rhshadrach self-assigned this Jan 4, 2023
@rhshadrach rhshadrach added this to the 2.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Categorical Categorical Data Type Groupby Performance Memory or execution speed performance
Projects
None yet
6 participants