-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Less blocks in groupby #29753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Less blocks in groupby #29753
Conversation
pandas/core/groupby/ops.py
Outdated
@@ -451,11 +451,7 @@ def _cython_operation( | |||
|
|||
# categoricals are only 1d, so we | |||
# are not setup for dim transforming | |||
if is_categorical_dtype(values) or is_sparse(values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the "fix" for #29746 - not sure this is really the right way to do that but makes tests green for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yah this definitely looks sketchy
4853bdd
to
2eed553
Compare
We absolutely do not want to remove the blocks. Just make the code simpler / easier. |
Yea making things simpler / easier is the point of this as well. Just wasn't sure if performance of frames mostly composed of numpy types that are wider than they are larger was a huge sticking point to optimize on I'll check out some alternates; I still think this method isn't required and adds a lot of unnecessary complexity / exception catching. It might make more sense if we allow |
i would lie to see as a first step moving this code to internals as an iterator |
Cool makes sense. Thanks for the feedback |
On the cProfile graphs are there any particular points of interest? |
@WillAyd i need to pivot to work on arithmetic for a bit, but if you're interested in taking taking my part of the groupby baton, the functions libreduction.check_result_array and libreduction._extract_result have analogues in the non-fast methods, but they behave slightly differently. I think its worthwhile to get those in alignment or at least document why they are slightly different. |
Cool yea will do @jbrockmendel |
else: # partial; specific to groupby | ||
# TODO: func should only return one value; need to remove | ||
# ohlc from groupby semantics to accomplish generically | ||
result, _ = f(blk.values) # better way? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better way?
blk.apply
are the asvs here worth salvaging? |
This isn't complete (2 tests fail) but this is arguably controversial, so putting up for review.
With #29124 we could get rid of the DataFrame aggregation methods that go by blocks. This would hopefully make our code easier to grok and move towards only "one way of doing things" in groupby.
This causes a definite regression for very wide DataFrames with a small number of rows. If you look at the new benchmark created, when testing Frames of N rows x 10_000 columns where N is one of 100, 1000 or 10_000 here are the results:
So drastically slower at 100 rows, but we've already reached a break-even point by the time we get to 10,000 rows.
The rest of the benchmarks don't cover wide data frames that well but here are the results for posterity (I think these are noise)
@jbrockmendel