Skip to content

ENH: Add separate numba kernels for groupby aggregations #53731

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Aug 10, 2023

Conversation

lithomas1
Copy link
Member

@lithomas1 lithomas1 commented Jun 19, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This PR adds separate groupby numba kernels, since the old shared kernels with Window required an expensive sort beforehand.

e.g. look at this snakeviz (pay attention to the take call. for reference, column_looper is the part doing the actual aggregation)

KPgbLhJsIxi4U9L7gyqBZ8TaJbgkSqg34gItFRdnMNZYnl_SFElErHYCdVtS9BD82yn2EHepAQ9P60-dbUtaXhSp3DSIe-_u8hOscAGnKFVBMCN0MKj-u7ccNK4f

Benchmarks

       before           after         ratio
     [43d12a23]       [6cf83167]
                      <groupby-numba-kernels>
-         197±5ms         66.6±1ms     0.34  groupby.GroupByNumbaAgg.time_frame_agg('float64', 'sum')
-         234±4ms       77.5±0.7ms     0.33  groupby.GroupByNumbaAgg.time_frame_agg('float64', 'var')
-         221±2ms       65.2±0.7ms     0.30  groupby.GroupByNumbaAgg.time_frame_agg('float64', 'mean')
-       20.1±0.1s         54.3±1ms     0.00  groupby.GroupByNumbaAgg.time_frame_agg('float64', 'max')
-       20.3±0.2s       54.3±0.8ms     0.00  groupby.GroupByNumbaAgg.time_frame_agg('float64', 'min')

Compared to Cython, we are still ~2x slower, but I think I can get back some more perf (would be a follow up PR).

If parallel=True is passed, though, we can be around ~2x faster than our Cython impl, though.

(This is on a Intel 2019 MBP, Intel core i7-9750H w/ 16 GB RAM)

cc @jbrockmendel if interested.
(You and I talked about parallelism in pandas a couple weeks back).

@lithomas1 lithomas1 added Groupby numba numba-accelerated operations labels Jun 19, 2023
@lithomas1 lithomas1 requested a review from rhshadrach as a code owner June 19, 2023 17:51
@@ -1484,6 +1484,7 @@ def _numba_agg_general(
func: Callable,
dtype_mapping: dict[np.dtype, Any],
engine_kwargs: dict[str, bool] | None,
is_grouped_kernel: bool = False,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically can remove this now, since all the existing numba kernels have been ported to have a groupby variant.

If you'd rather have me split this PR by kernel, this would be necessary (would remove after all the PRs land).

@lithomas1 lithomas1 force-pushed the groupby-numba-kernels branch from 6cf8316 to f29ea3e Compare June 19, 2023 17:55
@mroeschke
Copy link
Member

While there is a nice speedup here, I am a little concerned about proliferating implementations of reductions in the spirit of #53261

@lithomas1
Copy link
Member Author

While there is a nice speedup here, I am a little concerned about proliferating implementations of reductions in the spirit of #53261

The problem is the numba kernels are basically useless without this.

I was doing a little benchmarking a little while back and numba was ~5x slower than Cython before this change.
Because the majority of time is spent in the sorting overhead, this means that even running the numba aggregations in parallel doesn't really provide a speedup.

With these kernels at least, I think we can share the Kahan summation part for sum/mean/var.
I could probably also reimplement mean in terms of sum, so the impact of this shouldn't be too horrible.

@lithomas1
Copy link
Member Author

Gentle ping
@jbrockmendel @mroeschke @rhshadrach

lithomas1 referenced this pull request Jul 23, 2023
* ENH: non float64 result support in numba groupby

* refactor & simplify

* fix CI

* maybe green?

* skip unsupported ops in other bench as well

* updates from code review

* remove commented code

* update whatsnew

* debug benchmarks

* Skip min/max benchmarks
@@ -18,6 +18,9 @@
"pandas/_testing/__init__.py",
"pandas/_testing/_hypothesis.py",
"pandas/_testing/_io.py",
# Numba does not like typing.cast
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any prospect of this being fixed in numba? i.e. condition under which we can remove this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add something like "# TODO(numba#5474): can be removed once numba#5474 is addressed"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we lost the ability to add comments in the pyright json config file.

@jbrockmendel
Copy link
Member

If parallel=True is passed, though, we can be around ~2x faster than our Cython impl, though.

Is this one of the use cases where there is a prospect of parallelizing the cython version too?

@lithomas1
Copy link
Member Author

Is this one of the use cases where there is a prospect of parallelizing the cython version too?

In the near future, no, since shipping openmp in wheels is a PITA.
I think we need an openmp runtime package/wheel or something to depend on on PyPI before moving forwards with parallelizing Cython.

numba seems to at least have figured out how to solve this problem partially. I think you can use the tbb backend if installing with pip and on conda-forge you can choose between tbb/openmp.

@jbrockmendel
Copy link
Member

no objection in principle, will defer to matt since he's more involved in manitaining this code

Comment on lines 182 to 183
if nobs == 0 == min_periods:
result = 0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this hit? Why default to 0.0 instead of nan?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, my bad. I copied it from the sliding window code.

This is updated now.

@lithomas1
Copy link
Member Author

@mroeschke
Do you have time to take a look at this soon?

(I'd like to get this in, if not the apply stuff, at least for 2.1)

@mroeschke mroeschke added this to the 2.1 milestone Aug 10, 2023
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but still hoping one day we can still have one aggregation kernel that can apply to both window and groupby (and standard) functions 🤞

@mroeschke mroeschke merged commit b0e1130 into pandas-dev:main Aug 10, 2023
@mroeschke
Copy link
Member

Thanks @lithomas1

@lithomas1 lithomas1 deleted the groupby-numba-kernels branch August 10, 2023 20:59
mroeschke pushed a commit to mroeschke/pandas that referenced this pull request Aug 18, 2023
…53731)

* ENH: Add separate numba kernels for groupby aggregations

* add whatsnew

* fixes from pre-commit

* fix window tests

* fix tests?

* cleanup

* is_grouped_kernel=True in groupby

* typing

* fix typing?

* fix now

* Just ignore

* remove unnecessary code

* remove comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby numba numba-accelerated operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants