Skip to content

ENH: pd.DataFrame.groupby().apply: parallel #59635

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 3 tasks
mhooreman opened this issue Aug 27, 2024 · 2 comments
Closed
1 of 3 tasks

ENH: pd.DataFrame.groupby().apply: parallel #59635

mhooreman opened this issue Aug 27, 2024 · 2 comments
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@mhooreman
Copy link

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I have computationnaly heavy treatments on pandas.core.groupby.generic.DataFrameGroupBy or pandas.core.groupby.generic.SeriesGroupBy.

I would like to be able to do that in parallel, using multiple CPU.

Feature Description

Add new n_jobs: int=1 parameter to pandas.core.groupby.generic.SeriesGroupBy.apply and pandas.core.groupby.generic.DataFrameGroupBy.apply

  • Value 1: Keep the current behavior
  • Value > 1 or value == -1: Apply using joblib.Parallel, with the same n_jobs parameter
  • Else: Raise ValueError

Please see the alternative solutions section of this issue for implementation in my current situation, without using apply

Alternative Solutions

The code below is an example of the workaround I'm currently using. It's an adaptation of my real code, and has not been tested...

import joblib as jl

def my_function(*args, n_jobs=1):
    def callback(x):
      return "This is the result"

    if n_jobs == 1:
        return tuple(callback(x) for x in args)
    elif n_jobs > 1 or n_jobs == -1:
        jl.Parallel(n_jobs=n_jobs)(jl.delayed(callback)(x) for x in args)
    else:
        raise ValueError(f"{n_jobs=}")

gg_id = []
gg_data = []
group_by_cols = ['foo', 'bar', 'baz']
select_col = ['foo']
for g_id, g_data in data.groupby(group_by_cols)[select_col]:
    gg_id.append(g_id)
    gg_data.append(g_data.values)
results = my_function(*gg_data, parallel=True)
results = pd.DataFrame(dict(zip(gg_id, results))).T
results.index.names = group_by_cols

Additional Context

  • Please also see the documentation of joblib
  • Maybe solutions independent of joblib are possible, but I think that lots of people are already using joblib without knowing it, so I don't really think that it would be a big deal to add it to pandas dependencies
@mhooreman mhooreman added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 27, 2024
@mroeschke
Copy link
Member

Thanks for the report. This feature requests overlaps with #31845 and #43313 so let's keep the discussion in those issues so closing

@mhooreman
Copy link
Author

Thanks @mroeschke . It was not obvious to me that this overlaps with #31845 and #43313; sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

2 participants