-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add numba engine to groupby.transform #32854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @mroeschke! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-04-16 04:34:45 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable
pandas/core/util/numba_.py
Outdated
@@ -56,3 +58,44 @@ def impl(data, *_args): | |||
return impl | |||
|
|||
return numba_func | |||
|
|||
|
|||
def split_for_numba(arg: FrameOrSeries): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you annotate return types here?
pandas/core/util/numba_.py
Outdated
return arg.to_numpy(), arg.index.to_numpy(), columns_as_array | ||
|
||
|
||
def validate_udf(func: Callable, include_columns: bool = False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment
@mroeschke this is orthogonal to the .agg one? IOW does ordering of merge matter? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good.
I would introduce a Dispatcher concept here, with a Cython and a Numba Dispatcher.
this way we can move all of the messy logic to that class and just call generically.
klass = type(self._selected_obj) | ||
|
||
results = [] | ||
for name, group in self: | ||
object.__setattr__(group, "name", name) | ||
res = func(group, *args, **kwargs) | ||
if engine == "numba": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like to see this as
def _evaluate_udf
from pandas.compat._optional import import_optional_dependency | ||
|
||
|
||
def check_kwargs_and_nopython( | ||
kwargs: Optional[Dict] = None, nopython: Optional[bool] = None | ||
): | ||
) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a doc-string
|
||
def split_for_numba(arg: FrameOrSeries) -> Tuple[np.ndarray, np.ndarray, np.ndarray]: | ||
""" | ||
Split pandas object into its components as numpy arrays for numba functions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add Parameters / Returns section
|
||
|
||
def validate_udf(func: Callable, include_columns: bool = False) -> None: | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
for timings you can use this code
|
Here the timings with the above benchmark. I'll use a modified version of it for the ASV
|
It's also not immediately obvious how to include a Dispatcher concept here, especially since the "cython" path has a lot of fallback behavior. I can look into introducing a dispatcher concept in a followup |
Also @jreback, is it really useful that the if we are doing a DataFrame object transform that the column name is passed to the udf? I can see it as a potential pitfall as column names are usually strings and numba functions do not support objects? EDIT: Made a decision to not pass in the dataframe column as an array into the UDF before calling transform |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
In the same spirit of #31845, adding
engine
andengine_kwargs
arguments togroupby.transform
(which was easier to tackle first thangroupby.apply
). This signature is the same as what was added torolling.apply
.Constraints:
def f(values, index, ...)
, explicitly those names, as we will pass in the the values and the pandas index (as a numpy array) into the udf