-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: parallel support in .apply #13111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @mcg1969 |
I too worry about premature parallelization, but I'm not sure how you prevent that. I think I'd rather try and find ways to encourage Numba compilation (and if there are missing Numba features preventing that from being effective, address them). At least then you could engage |
As a reminder, here is the recipe for a simple convert-apply-convert with minimal overhead for dask > 0.8.2 |
For those also wondering the You must also pass |
What kind of effort would be required exactly to implement this feature? I think if someone gives a fairly detailed answer to that question, it should help all concerned people in the community. |
@enlighter does the dask-based solution work for you? Properly supporting this in pandas would be a decent amount of effort. Dask already takes care of much of that. |
closing this for now. if a specific soln is wanted by the community could open an issue that. |
xref #5751
questions from SO.
mrocklins nice example of using .apply
So here is an example of how to do a parallel apply using dask. This could be baked into
.apply()
in pandas by the following signature enhancement:current:
proposed:
where
engine='dask'
(ornumba
at some point) are possibilitieschunksize
would map directly tonpartitions
and default to the number of cores if not specified.further would allow
engine
to be a meta object likeDask(scheduler='multiprocessing')
to support other options one would commonly pass (could also movechunksize
inside that instead of as a separate object).impl and timings:
Now for some caveats.
People want to parallelize a poor implementation. Generally you proceed thru the following steps first:
cython
ornumba
on the user defined function.apply
is always the last choiceYou always want to make code simpler, not more complex. Its hard to know a-priori where bottlenecks are. People think
.apply
is some magical thing, its NOT, its JUST A FOR LOOP. The problem is people tend to throw in the kitchen sink, and just everything, which is just a terrible idea.Ok my 2c about optimizing things.
In order for parallelization to actually matter the function you are computing should take some non-trivial amount of time to things like:
If these criteria are met, then sure give it a try.
I think providing pandas a first class way to parallelize things, even tough people will just naively use it is probably not a bad thing.
Further extensions to this are:
to_dask()
(return adask.dataframe
to the user directly), andengine='dask'
syntax for.groupby()
The text was updated successfully, but these errors were encountered: