Skip to content

WIP: ENH: Add numba engine to groupby apply #35445

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Jul 29, 2020

Some notes:

  • The passed function must be a reduction op
  • The numba engine will drop the grouping column by default
  • Can only operate on numeric data and will return float64

Preliminary default timing:

(pandas-dev) matthewroeschke:pandas-mroeschke matthewroeschke$ ipython
Python 3.8.3 | packaged by conda-forge | (default, Jun  1 2020, 17:21:09)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.16.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: df_g = pd.DataFrame({'a': range(10**4), 'b': range(10**4), 'c': range(10**4)})

In [2]: def f(x):
   ...:     return np.sum(x) + 1
   ...:

In [3]: df_g.groupby('a').apply(f)
Out[3]:
          a      b      c
a
0         1      1      1
1         2      2      2
2         3      3      3
3         4      4      4
4         5      5      5
...     ...    ...    ...
9995   9996   9996   9996
9996   9997   9997   9997
9997   9998   9998   9998
9998   9999   9999   9999
9999  10000  10000  10000

[10000 rows x 3 columns]

In [4]: %timeit df_g.groupby('a').apply(f)
3.07 s ± 57.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: df_g.groupby('a').apply(f, engine='numba', engine_kwargs={'parallel': True})
Out[5]:
            0        1
0         1.0      1.0
1         2.0      2.0
2         3.0      3.0
3         4.0      4.0
4         5.0      5.0
...       ...      ...
9995   9996.0   9996.0
9996   9997.0   9997.0
9997   9998.0   9998.0
9998   9999.0   9999.0
9999  10000.0  10000.0

[10000 rows x 2 columns]

In [6]: %timeit df_g.groupby('a').apply(f, engine='numba', engine_kwargs={'parallel': True})

510 ms ± 3.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@pep8speaks
Copy link

pep8speaks commented Jul 29, 2020

Hello @mroeschke! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-05 18:13:49 UTC

@mroeschke mroeschke closed this Aug 16, 2020
@mroeschke mroeschke deleted the numba_groupby_apply branch November 12, 2020 05:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Add engine keyword argument to groupby.apply to leverage Numba
2 participants