Skip to content

ENH: optimized Groupby.diff() #33658

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dequadras opened this issue Apr 19, 2020 · 6 comments
Closed

ENH: optimized Groupby.diff() #33658

dequadras opened this issue Apr 19, 2020 · 6 comments
Labels
Enhancement Groupby Performance Memory or execution speed performance

Comments

@dequadras
Copy link
Contributor

Is your feature request related to a problem?

Doing groupby().diff() with a big dataset and many groups is quite slow. In this image, it is shown how in certain cases optimizing it with numba can get 1000x speed.

image

Describe the solution you'd like

Now, my question is, can this be optimized in pandas?
I realise the case is somehow special, but i've had to work with small groups and I'm finding some speed issues.

API breaking implications

[this should provide a description of how this feature will affect the API]

Describe alternatives you've considered

[this should provide a description of any alternative solutions or features you've considered]

Additional context

Here's the python code in text format

import numpy as np
import pandas as pd
from numba import njit

# create dataframe with many groups
GROUPS = 100000
SIZE = 1000000
df = pd.DataFrame()
df["groups"]=np.random.choice(np.arange(GROUPS), size=SIZE)
df["values"] = np.random.random(size=SIZE)
df.sort_values("groups", inplace=True)

diff_pandas = df.groupby("groups")["values"].diff().values

@njit
def group_diff(groups: np.array, values: np.array, lag: int) -> np.array:
    result_exp_mean = np.empty_like(values, dtype=np.float64)
    for i in range(values.shape[0]):
        if groups[i] == groups[i - lag]:
            result_exp_mean[i] = values[i] - values[i - lag]
        else:
            result_exp_mean[i] = np.nan
    return result_exp_mean

groups = df.groupby("groups").ngroup().values
values = df["values"].values
diff_numba = group_diff(groups, values, 1)

# check that it is equal
np.isclose(diff_pandas, diff_numba, equal_nan=True).all()
@dequadras dequadras added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 19, 2020
@dsaxton
Copy link
Member

dsaxton commented Apr 19, 2020

Seems related to some work @mroeschke has done

@dsaxton dsaxton added Groupby Performance Memory or execution speed performance and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 19, 2020
@jbrockmendel
Copy link
Member

PR would be welcome

@lukemanley
Copy link
Member

Does #45575 address this? It was merged after this issue was opened. It doesn't use numba but it did get 1000x for a handful of cases

@jbrockmendel
Copy link
Member

Any idea how thorough the "handful of cases" were? or if there is non-trivial room for further improvement by implementing something in groupby.pyx?

@lukemanley
Copy link
Member

Any idea how thorough the "handful of cases" were? or if there is non-trivial room for further improvement by implementing something in groupby.pyx?

#45575 shows the ASV which covers a lot of different cases. Not all are 1000x but most cases see a significant improvement

@jbrockmendel
Copy link
Member

OK. im happy to consider this resolved. good job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Groupby Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants