Skip to content

Major Performance regression of df.groupby(..).indices #38495

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bordingj opened this issue Dec 15, 2020 · 8 comments · Fixed by #38892
Closed

Major Performance regression of df.groupby(..).indices #38495

bordingj opened this issue Dec 15, 2020 · 8 comments · Fixed by #38892
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@bordingj
Copy link

I'm experiencing major performance regressions with pandas=1.1.5 versus 1.1.3

Version 1.1.3:

Python 3.7.9 | packaged by conda-forge | (default, Dec  9 2020, 20:36:16) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help.
PyDev console: using IPython 7.19.0
Python 3.7.9 | packaged by conda-forge | (default, Dec  9 2020, 20:36:16) [MSC v.1916 64 bit (AMD64)] on win32
In[2]: import time
 ... : import numpy as np
 ... : import pandas as pd
 ... : pd.__version__
Out[2]: '1.1.3'
In[3]: numel = 10000000
 ... : df = pd.DataFrame(dict(a=np.random.rand(numel), b=np.random.randint(0,4000, numel)))
 ... : start = time.time()
 ... : groupby_indices = df.groupby('b').indices
 ... : time.time() - start
Out[3]: 0.46085023880004883

Version 1.1.5:

Python 3.7.9 | packaged by conda-forge | (default, Dec  9 2020, 20:36:16) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help.
PyDev console: using IPython 7.19.0
Python 3.7.9 | packaged by conda-forge | (default, Dec  9 2020, 20:36:16) [MSC v.1916 64 bit (AMD64)] on win32
In[2]: import time
 ... : import numpy as np
 ... : import pandas as pd
 ... : pd.__version__
Out[2]: '1.1.5'
In[3]: numel = 10000000
 ... : df = pd.DataFrame(dict(a=np.random.rand(numel), b=np.random.randint(0,4000, numel)))
 ... : start = time.time()
 ... : groupby_indices = df.groupby('b').indices
 ... : time.time() - start
Out[3]: 57.36550998687744
@jorisvandenbossche
Copy link
Member

@bordingj thanks a lot for the report.
I can confirm that the above code runs slow, and this is already the case on 1.1.4 (so introduced in 1.1.4, and not in 1.1.5, I suppose). Strangely, I don't see a slowdown on master ..

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 15, 2020

I think this is caused by #36911 (which changed the implementation of Grouping.indices, and Categorical._reverse_indexer was much faster as the manual dict comprehension), and the reason we don't see it on master is probably thanks to #36842 (after which we no longer use Grouping.indices to calculate Grouper.indices)

cc @mroeschke @phofl

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version labels Dec 15, 2020
@phofl
Copy link
Member

phofl commented Dec 15, 2020

Grouping.indices could not handle nans previously so we switched to the cython implementation in #36842, which is more consistent anyway I think.

@simonjayhawkins simonjayhawkins added the Closing Candidate May be closeable, needs more eyeballs label Dec 15, 2020
@simonjayhawkins
Copy link
Member

@jorisvandenbossche if this is fixed on master (and 1.2.0rc0) and we are not currently planning another release in the 1.1.x series, can this issue be closed?

@phofl
Copy link
Member

phofl commented Dec 16, 2020

I think we are using the function causing the performance regression in other places. Don't know how the impact is there, but not sure if we can close this.

@simonjayhawkins simonjayhawkins removed the Closing Candidate May be closeable, needs more eyeballs label Dec 19, 2020
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Dec 19, 2020
@phofl
Copy link
Member

phofl commented Dec 19, 2020

Have to correct myself here. Looks like we are only using the method in

def test_grouping_is_iterable(self, tsframe):

which seems to be more or less a dummy test. Ran related tests locally.

Not quite sure what to do with this method? Is the code snipped from the test intended to be public available?

@phofl
Copy link
Member

phofl commented Jan 1, 2021

I think this can be closed now that #38649 is merged?

@jreback
Copy link
Contributor

jreback commented Jan 1, 2021

yea i think so
do we have sufficient asvs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants