You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have computationnaly heavy treatments on pandas.core.groupby.generic.DataFrameGroupBy or pandas.core.groupby.generic.SeriesGroupBy.
I would like to be able to do that in parallel, using multiple CPU.
Feature Description
Add new n_jobs: int=1 parameter to pandas.core.groupby.generic.SeriesGroupBy.apply and pandas.core.groupby.generic.DataFrameGroupBy.apply
Value 1: Keep the current behavior
Value > 1 or value == -1: Apply using joblib.Parallel, with the same n_jobs parameter
Else: Raise ValueError
Please see the alternative solutions section of this issue for implementation in my current situation, without using apply
Alternative Solutions
The code below is an example of the workaround I'm currently using. It's an adaptation of my real code, and has not been tested...
import joblib as jl
def my_function(*args, n_jobs=1):
def callback(x):
return "This is the result"
if n_jobs == 1:
return tuple(callback(x) for x in args)
elif n_jobs > 1 or n_jobs == -1:
jl.Parallel(n_jobs=n_jobs)(jl.delayed(callback)(x) for x in args)
else:
raise ValueError(f"{n_jobs=}")
gg_id = []
gg_data = []
group_by_cols = ['foo', 'bar', 'baz']
select_col = ['foo']
for g_id, g_data in data.groupby(group_by_cols)[select_col]:
gg_id.append(g_id)
gg_data.append(g_data.values)
results = my_function(*gg_data, parallel=True)
results = pd.DataFrame(dict(zip(gg_id, results))).T
results.index.names = group_by_cols
Additional Context
Please also see the documentation of joblib
Maybe solutions independent of joblib are possible, but I think that lots of people are already using joblib without knowing it, so I don't really think that it would be a big deal to add it to pandas dependencies
The text was updated successfully, but these errors were encountered:
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
I have computationnaly heavy treatments on
pandas.core.groupby.generic.DataFrameGroupBy
orpandas.core.groupby.generic.SeriesGroupBy
.I would like to be able to do that in parallel, using multiple CPU.
Feature Description
Add new
n_jobs: int=1
parameter topandas.core.groupby.generic.SeriesGroupBy.apply
andpandas.core.groupby.generic.DataFrameGroupBy.apply
joblib.Parallel
, with the samen_jobs
parameterValueError
Please see the alternative solutions section of this issue for implementation in my current situation, without using
apply
Alternative Solutions
The code below is an example of the workaround I'm currently using. It's an adaptation of my real code, and has not been tested...
Additional Context
joblib
The text was updated successfully, but these errors were encountered: