-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
memory regression in grouping by categorical variables #32918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@jangorecki Thanks for the report! Didn't yet look in detail, but possibly related to #30552 ? Can you try if adding |
@jorisvandenbossche Thanks for pointing that out. It looks as a duplicate, but I tried solution provided there and it is not solving the problem, so must be something else: ans = x.groupby(['id1','id2','id3','id4','id5','id6']).agg({'v3':'sum', 'v1':'count'}, observed=False)
#MemoryError
ans = x.groupby(['id1','id2','id3','id4','id5','id6']).agg({'v3':'sum', 'v1':'count'}, observed=True)
#MemoryError |
Is there any chance to have it assinged to a milestone? To ensure fixing this problem is on the roadmap. |
Pandas is primarily a volunteer effort. We don't have a roadmap outside large items. Issues are assigned to milestones when a pull request is made. The quickest way to get this fixed would be with a pull request . |
and what exactly would this do? we are all volunteer. if a patch is put up, then the volunteers can contribute time to review this. |
@jreback Assigning to milestone gives it extra attention of volunteers. Browsing issues assigned to a milestone just to see if there is anything that I could help with happened to me multiple times. Otherwise issue is buried among hundreds others and is likely to be missed. |
its contributions welcome. PRs are welcome. |
@jangorecki your usage of ans = x.groupby(['id1','id2','id3','id4','id5','id6'], observed=True).agg({'v3':'sum', 'v1':'count'}) |
@TomAugspurger Thank you for spotting that. Yes, it does the job. Query now completes not only for 1e7 rows but for 1e8 rows as well. Benchmark report updated. Thanks again! |
There seems to be a regression when grouping by categorical columns.
One year old version 0.24.2 was able to complete the queryn, while 1.0.3 is hitting MemoryError.
memory_usage(deep=True)
reports size of data frame to be 524 MB, while my machine has 125 GB so memory should not be an issue.Input
Output
1.0.3
0.24.2
The text was updated successfully, but these errors were encountered: