Skip to content

groupby().transform(f) is very slow if there are assignment statements to the argument of f() #9945

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ruoyu0088 opened this issue Apr 20, 2015 · 3 comments
Labels
Duplicate Report Duplicate issue or pull request Groupby Performance Memory or execution speed performance

Comments

@ruoyu0088
Copy link

Here is the test code:

import pandas as pd
from numpy.random import randint, rand, seed

seed(10)
N_Group = 20
N_Rows = 1000
df = pd.DataFrame({"g":randint(0, N_Group, N_Rows), "a":rand(N_Rows), "b":rand(N_Rows)})

def f(df):
    df["a"] -= df["b"].mean()
    return df

%time df1 = df.groupby("g").apply(f)
%time df2 = df.groupby("g").transform(f)

the output is:

Wall time: 17 ms
Wall time: 682 ms

before the for-loop for every group in apply(), it disable the 'mode.chained_assignment' opition:

    # ignore SettingWithCopy here in case the user mutates
    with option_context('mode.chained_assignment',None):
        return self._python_apply_general(f)

but there is not such code for transform().

@shoyer
Copy link
Member

shoyer commented Apr 20, 2015

This may technically work, but it's really not a good idea to use in-place operations in a groupby operation. For example, see the warning here. So I have mixed feelings about optimizing this.

@jreback jreback added Groupby Performance Memory or execution speed performance Difficulty Intermediate labels Apr 20, 2015
@jreback jreback added this to the Next Major Release milestone Apr 20, 2015
@jreback
Copy link
Contributor

jreback commented Apr 20, 2015

I'll put it on the list. A PR would be helpful here. I agree with @shoyer here. IMHO modifying things in a groupby should simply be banned. This is at odds with pure functions and all semantics except where explict in pandas

@rhshadrach
Copy link
Member

Duplicate of #12653

@rhshadrach rhshadrach marked this as a duplicate of #12653 Apr 16, 2021
@rhshadrach rhshadrach added the Duplicate Report Duplicate issue or pull request label Apr 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Groupby Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

5 participants