ENH: pd.DataFrame.groupby().apply: parallel #59635

mhooreman · 2024-08-27T23:37:18Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

I have computationnaly heavy treatments on pandas.core.groupby.generic.DataFrameGroupBy or pandas.core.groupby.generic.SeriesGroupBy.

I would like to be able to do that in parallel, using multiple CPU.

Feature Description

Add new n_jobs: int=1 parameter to pandas.core.groupby.generic.SeriesGroupBy.apply and pandas.core.groupby.generic.DataFrameGroupBy.apply

Value 1: Keep the current behavior
Value > 1 or value == -1: Apply using joblib.Parallel, with the same n_jobs parameter
Else: Raise ValueError

Please see the alternative solutions section of this issue for implementation in my current situation, without using apply

Alternative Solutions

The code below is an example of the workaround I'm currently using. It's an adaptation of my real code, and has not been tested...

import joblib as jl

def my_function(*args, n_jobs=1):
    def callback(x):
      return "This is the result"

    if n_jobs == 1:
        return tuple(callback(x) for x in args)
    elif n_jobs > 1 or n_jobs == -1:
        jl.Parallel(n_jobs=n_jobs)(jl.delayed(callback)(x) for x in args)
    else:
        raise ValueError(f"{n_jobs=}")

gg_id = []
gg_data = []
group_by_cols = ['foo', 'bar', 'baz']
select_col = ['foo']
for g_id, g_data in data.groupby(group_by_cols)[select_col]:
    gg_id.append(g_id)
    gg_data.append(g_data.values)
results = my_function(*gg_data, parallel=True)
results = pd.DataFrame(dict(zip(gg_id, results))).T
results.index.names = group_by_cols

Additional Context

Please also see the documentation of joblib
Maybe solutions independent of joblib are possible, but I think that lots of people are already using joblib without knowing it, so I don't really think that it would be a big deal to add it to pandas dependencies

The text was updated successfully, but these errors were encountered:

mroeschke · 2024-08-27T23:42:49Z

Thanks for the report. This feature requests overlaps with #31845 and #43313 so let's keep the discussion in those issues so closing

mhooreman · 2024-08-27T23:59:42Z

Thanks @mroeschke . It was not obvious to me that this overlaps with #31845 and #43313; sorry.

mhooreman added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 27, 2024

mroeschke closed this as completed Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: pd.DataFrame.groupby().apply: parallel #59635

ENH: pd.DataFrame.groupby().apply: parallel #59635

mhooreman commented Aug 27, 2024

mroeschke commented Aug 27, 2024

mhooreman commented Aug 27, 2024

ENH: pd.DataFrame.groupby().apply: parallel #59635

ENH: pd.DataFrame.groupby().apply: parallel #59635

Comments

mhooreman commented Aug 27, 2024

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

mroeschke commented Aug 27, 2024

mhooreman commented Aug 27, 2024