Skip to content

memory regression in grouping by categorical variables #32918

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jangorecki opened this issue Mar 23, 2020 · 9 comments
Closed

memory regression in grouping by categorical variables #32918

jangorecki opened this issue Mar 23, 2020 · 9 comments
Labels
Categorical Categorical Data Type Groupby Performance Memory or execution speed performance

Comments

@jangorecki
Copy link

jangorecki commented Mar 23, 2020

There seems to be a regression when grouping by categorical columns.
One year old version 0.24.2 was able to complete the queryn, while 1.0.3 is hitting MemoryError.

memory_usage(deep=True) reports size of data frame to be 524 MB, while my machine has 125 GB so memory should not be an issue.

Input

import pandas as pd
import numpy as np

def randChar(f, numGrp, N) :
   things = [f%x for x in range(numGrp)]
   return [things[x] for x in np.random.choice(numGrp, N)]

def randFloat(numGrp, N) :
   things = [round(100*np.random.random(),4) for x in range(numGrp)]
   return [things[x] for x in np.random.choice(numGrp, N)]

N = int(1e7)
K = 100
x = pd.DataFrame({
  'id1' : randChar("id%03d", K, N),       # large groups (char)
  'id2' : randChar("id%03d", K, N),       # large groups (char)
  'id3' : randChar("id%010d", N//K, N),   # small groups (char)
  'id4' : np.random.choice(K, N),         # large groups (int)
  'id5' : np.random.choice(K, N),         # large groups (int)
  'id6' : np.random.choice(N//K, N),      # small groups (int)
  'v1' :  np.random.choice(5, N),         # int in range [1,5]
  'v2' :  np.random.choice(5, N),         # int in range [1,5]
  'v3' :  randFloat(100,N)                # numeric e.g. 23.5749
})
x['id1'] = x['id1'].astype('category')
x['id2'] = x['id2'].astype('category')
x['id3'] = x['id3'].astype('category')

import os
import gc
import timeit

print(pd.__version__)

gc.collect()
t_start = timeit.default_timer()
ans = x.groupby(['id1','id2','id3','id4','id5','id6']).agg({'v3':'sum', 'v1':'count'})
ans.reset_index(inplace=True)
print(ans.shape, flush=True)
t = timeit.default_timer() - t_start
print(t)

Output

1.0.3

print(pd.__version__)
#1.0.3 
gc.collect()
#0
t_start = timeit.default_timer()
ans = x.groupby(['id1','id2','id3','id4','id5','id6']).agg({'v3':'sum', 'v1'
:'count'})
#Traceback (most recent call last):
#  File "<stdin>", line 1, in <module>
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/generic.py", line 928, in aggregate
#    result, how = self._aggregate(func, *args, **kwargs)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/base.py", line 419, in _aggregate
#    result = _agg(arg, _agg_1dim)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/base.py", line 386, in _agg
#    result[fname] = func(fname, agg_how)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/base.py", line 370, in _agg_1dim
#    return colg.aggregate(how)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/generic.py", line 247, in aggregate
#    return getattr(self, func)(*args, **kwargs)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/groupby.py", line 1371, in f
#    return self._cython_agg_general(alias, alt=npfunc, **kwargs)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/groupby.py", line 909, in _cython_agg_general
#    return self._wrap_aggregated_output(output)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/generic.py", line 386, in _wrap_aggregated_output
#    return self._reindex_output(result)._convert(datetime=True)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/groupby.py", line 2483, in _reindex_output
#    levels_list, names=self.grouper.names
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/indexes/multi.py", line 552, in from_product
#    codes = cartesian_product(codes)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/reshape/util.py", line 58, in cartesian_product
#    for i, x in enumerate(X)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/reshape/util.py", line 58, in <listcomp>
#    for i, x in enumerate(X)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#numpy/core/fromnumeric.py", line 445, in repeat
#    return _wrapfunc(a, 'repeat', repeats, axis=axis)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#numpy/core/fromnumeric.py", line 51, in _wrapfunc
#    return getattr(obj, method)(*args, **kwds)
#MemoryError

0.24.2

print(pd.__version__)
#0.24.2
gc.collect()
#0
t_start = timeit.default_timer()
ans = x.groupby(['id1','id2','id3','id4','id5','id6']).agg({'v3':'sum', 'v1':'count'})
ans.reset_index(inplace=True)
print(ans.shape, flush=True)
#(10000000, 8)
t = timeit.default_timer() - t_start
print(t)
#9.52401043381542
@jorisvandenbossche
Copy link
Member

@jangorecki Thanks for the report!

Didn't yet look in detail, but possibly related to #30552 ? Can you try if adding observed=False changes anything?

@jangorecki
Copy link
Author

jangorecki commented Mar 25, 2020

@jorisvandenbossche Thanks for pointing that out. It looks as a duplicate, but I tried solution provided there and it is not solving the problem, so must be something else:

ans = x.groupby(['id1','id2','id3','id4','id5','id6']).agg({'v3':'sum', 'v1':'count'}, observed=False)
#MemoryError
ans = x.groupby(['id1','id2','id3','id4','id5','id6']).agg({'v3':'sum', 'v1':'count'}, observed=True)
#MemoryError

@jangorecki
Copy link
Author

Is there any chance to have it assinged to a milestone? To ensure fixing this problem is on the roadmap.
This regression is quite annoying because it affects relatively small data.
Data is 0.5 GB, 125 GB available memory, that means that doing this query requires more than 250 times more memory than the data has.

@TomAugspurger
Copy link
Contributor

Pandas is primarily a volunteer effort. We don't have a roadmap outside large items. Issues are assigned to milestones when a pull request is made. The quickest way to get this fixed would be with a pull request .

@jreback
Copy link
Contributor

jreback commented May 13, 2020

Is there any chance to have it assinged to a milestone? To ensure fixing this problem is on the roadmap.
This regression is quite annoying because it affects relatively small data.
Data is 0.5 GB, 125 GB available memory, that means that doing this query requires more than 250 times more memory than the data has.

and what exactly would this do?

we are all volunteer. if a patch is put up, then the volunteers can contribute time to review this.

@jangorecki
Copy link
Author

jangorecki commented May 13, 2020

@jreback Assigning to milestone gives it extra attention of volunteers. Browsing issues assigned to a milestone just to see if there is anything that I could help with happened to me multiple times. Otherwise issue is buried among hundreds others and is likely to be missed.
Because it is a regression, there are efficient ways, like git bisect, to track down the exact commit that introduced it, making the fix easier much easier to implement.

@jreback jreback added Categorical Categorical Data Type Groupby labels May 13, 2020
@jreback jreback added this to the Contributions Welcome milestone May 13, 2020
@jreback
Copy link
Contributor

jreback commented May 13, 2020

@jreback Assinging to milestone gives it attention of volunteers. I think that browsing issues assigned to a milestone just to see if there is anything that I could help with happened to me multiple times. Otherwise issue is buried among hundreds others and is likely to be missed.
Because it is a regression, there are efficient ways, like git bisect, to track down the exact commit that introduced it, making the fix easier much easier to implement.

its contributions welcome. PRs are welcome.

@TomAugspurger
Copy link
Contributor

@jangorecki your usage of observed in #32918 (comment) wasn't quite right. It should be passed to groupby, not agg. Can you try this?

ans = x.groupby(['id1','id2','id3','id4','id5','id6'], observed=True).agg({'v3':'sum', 'v1':'count'})

@jangorecki
Copy link
Author

@TomAugspurger Thank you for spotting that. Yes, it does the job. Query now completes not only for 1e7 rows but for 1e8 rows as well. Benchmark report updated. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Groupby Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants