memory regression in grouping by categorical variables #32918

jangorecki · 2020-03-23T01:35:55Z

There seems to be a regression when grouping by categorical columns.
One year old version 0.24.2 was able to complete the queryn, while 1.0.3 is hitting MemoryError.

memory_usage(deep=True) reports size of data frame to be 524 MB, while my machine has 125 GB so memory should not be an issue.

Input

import pandas as pd
import numpy as np

def randChar(f, numGrp, N) :
   things = [f%x for x in range(numGrp)]
   return [things[x] for x in np.random.choice(numGrp, N)]

def randFloat(numGrp, N) :
   things = [round(100*np.random.random(),4) for x in range(numGrp)]
   return [things[x] for x in np.random.choice(numGrp, N)]

N = int(1e7)
K = 100
x = pd.DataFrame({
  'id1' : randChar("id%03d", K, N),       # large groups (char)
  'id2' : randChar("id%03d", K, N),       # large groups (char)
  'id3' : randChar("id%010d", N//K, N),   # small groups (char)
  'id4' : np.random.choice(K, N),         # large groups (int)
  'id5' : np.random.choice(K, N),         # large groups (int)
  'id6' : np.random.choice(N//K, N),      # small groups (int)
  'v1' :  np.random.choice(5, N),         # int in range [1,5]
  'v2' :  np.random.choice(5, N),         # int in range [1,5]
  'v3' :  randFloat(100,N)                # numeric e.g. 23.5749
})
x['id1'] = x['id1'].astype('category')
x['id2'] = x['id2'].astype('category')
x['id3'] = x['id3'].astype('category')

import os
import gc
import timeit

print(pd.__version__)

gc.collect()
t_start = timeit.default_timer()
ans = x.groupby(['id1','id2','id3','id4','id5','id6']).agg({'v3':'sum', 'v1':'count'})
ans.reset_index(inplace=True)
print(ans.shape, flush=True)
t = timeit.default_timer() - t_start
print(t)

Output

1.0.3

print(pd.__version__)
#1.0.3 
gc.collect()
#0
t_start = timeit.default_timer()
ans = x.groupby(['id1','id2','id3','id4','id5','id6']).agg({'v3':'sum', 'v1'
:'count'})
#Traceback (most recent call last):
#  File "<stdin>", line 1, in <module>
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/generic.py", line 928, in aggregate
#    result, how = self._aggregate(func, *args, **kwargs)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/base.py", line 419, in _aggregate
#    result = _agg(arg, _agg_1dim)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/base.py", line 386, in _agg
#    result[fname] = func(fname, agg_how)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/base.py", line 370, in _agg_1dim
#    return colg.aggregate(how)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/generic.py", line 247, in aggregate
#    return getattr(self, func)(*args, **kwargs)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/groupby.py", line 1371, in f
#    return self._cython_agg_general(alias, alt=npfunc, **kwargs)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/groupby.py", line 909, in _cython_agg_general
#    return self._wrap_aggregated_output(output)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/generic.py", line 386, in _wrap_aggregated_output
#    return self._reindex_output(result)._convert(datetime=True)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/groupby/groupby.py", line 2483, in _reindex_output
#    levels_list, names=self.grouper.names
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/indexes/multi.py", line 552, in from_product
#    codes = cartesian_product(codes)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/reshape/util.py", line 58, in cartesian_product
#    for i, x in enumerate(X)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#pandas/core/reshape/util.py", line 58, in <listcomp>
#    for i, x in enumerate(X)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#numpy/core/fromnumeric.py", line 445, in repeat
#    return _wrapfunc(a, 'repeat', repeats, axis=axis)
#  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/
#numpy/core/fromnumeric.py", line 51, in _wrapfunc
#    return getattr(obj, method)(*args, **kwds)
#MemoryError

0.24.2

print(pd.__version__)
#0.24.2
gc.collect()
#0
t_start = timeit.default_timer()
ans = x.groupby(['id1','id2','id3','id4','id5','id6']).agg({'v3':'sum', 'v1':'count'})
ans.reset_index(inplace=True)
print(ans.shape, flush=True)
#(10000000, 8)
t = timeit.default_timer() - t_start
print(t)
#9.52401043381542

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2020-03-23T08:56:50Z

@jangorecki Thanks for the report!

Didn't yet look in detail, but possibly related to #30552 ? Can you try if adding observed=False changes anything?

jangorecki · 2020-03-25T09:57:06Z

@jorisvandenbossche Thanks for pointing that out. It looks as a duplicate, but I tried solution provided there and it is not solving the problem, so must be something else:

ans = x.groupby(['id1','id2','id3','id4','id5','id6']).agg({'v3':'sum', 'v1':'count'}, observed=False)
#MemoryError
ans = x.groupby(['id1','id2','id3','id4','id5','id6']).agg({'v3':'sum', 'v1':'count'}, observed=True)
#MemoryError

jangorecki · 2020-05-13T13:36:01Z

Is there any chance to have it assinged to a milestone? To ensure fixing this problem is on the roadmap.
This regression is quite annoying because it affects relatively small data.
Data is 0.5 GB, 125 GB available memory, that means that doing this query requires more than 250 times more memory than the data has.

TomAugspurger · 2020-05-13T13:48:50Z

Pandas is primarily a volunteer effort. We don't have a roadmap outside large items. Issues are assigned to milestones when a pull request is made. The quickest way to get this fixed would be with a pull request .

jreback · 2020-05-13T13:49:21Z

Is there any chance to have it assinged to a milestone? To ensure fixing this problem is on the roadmap.
This regression is quite annoying because it affects relatively small data.
Data is 0.5 GB, 125 GB available memory, that means that doing this query requires more than 250 times more memory than the data has.

and what exactly would this do?

we are all volunteer. if a patch is put up, then the volunteers can contribute time to review this.

jangorecki · 2020-05-13T14:32:55Z

@jreback Assigning to milestone gives it extra attention of volunteers. Browsing issues assigned to a milestone just to see if there is anything that I could help with happened to me multiple times. Otherwise issue is buried among hundreds others and is likely to be missed.
Because it is a regression, there are efficient ways, like git bisect, to track down the exact commit that introduced it, making the fix easier much easier to implement.

jreback · 2020-05-13T14:34:12Z

@jreback Assinging to milestone gives it attention of volunteers. I think that browsing issues assigned to a milestone just to see if there is anything that I could help with happened to me multiple times. Otherwise issue is buried among hundreds others and is likely to be missed.
Because it is a regression, there are efficient ways, like git bisect, to track down the exact commit that introduced it, making the fix easier much easier to implement.

its contributions welcome. PRs are welcome.

TomAugspurger · 2020-05-19T21:43:20Z

@jangorecki your usage of observed in #32918 (comment) wasn't quite right. It should be passed to groupby, not agg. Can you try this?

ans = x.groupby(['id1','id2','id3','id4','id5','id6'], observed=True).agg({'v3':'sum', 'v1':'count'})

jangorecki · 2020-05-20T00:51:41Z

@TomAugspurger Thank you for spotting that. Yes, it does the job. Query now completes not only for 1e7 rows but for 1e8 rows as well. Benchmark report updated. Thanks again!

jangorecki mentioned this issue Mar 23, 2020

pandas not able to solve q10 anymore h2oai/db-benchmark#142

Closed

jreback added Categorical Categorical Data Type Groupby labels May 13, 2020

jreback added this to the Contributions Welcome milestone May 13, 2020

jreback added the Performance Memory or execution speed performance label May 13, 2020

jorisvandenbossche mentioned this issue May 13, 2020

BUG: Performance regression in categorical indexer #34162

Closed

3 tasks

jangorecki mentioned this issue May 14, 2020

PERF: groupby with many empty groups memory blowup #30552

Closed

jangorecki closed this as completed May 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory regression in grouping by categorical variables #32918

memory regression in grouping by categorical variables #32918

jangorecki commented Mar 23, 2020 •

edited

Loading

jorisvandenbossche commented Mar 23, 2020

jangorecki commented Mar 25, 2020 •

edited

Loading

jangorecki commented May 13, 2020

TomAugspurger commented May 13, 2020

jreback commented May 13, 2020

jangorecki commented May 13, 2020 •

edited

Loading

jreback commented May 13, 2020

TomAugspurger commented May 19, 2020

jangorecki commented May 20, 2020

memory regression in grouping by categorical variables #32918

memory regression in grouping by categorical variables #32918

Comments

jangorecki commented Mar 23, 2020 • edited Loading

Input

Output

1.0.3

0.24.2

jorisvandenbossche commented Mar 23, 2020

jangorecki commented Mar 25, 2020 • edited Loading

jangorecki commented May 13, 2020

TomAugspurger commented May 13, 2020

jreback commented May 13, 2020

jangorecki commented May 13, 2020 • edited Loading

jreback commented May 13, 2020

TomAugspurger commented May 19, 2020

jangorecki commented May 20, 2020

jangorecki commented Mar 23, 2020 •

edited

Loading

jangorecki commented Mar 25, 2020 •

edited

Loading

jangorecki commented May 13, 2020 •

edited

Loading