Skip to content

Less blocks in groupby #29753

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from
Closed

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Nov 20, 2019

This isn't complete (2 tests fail) but this is arguably controversial, so putting up for review.

With #29124 we could get rid of the DataFrame aggregation methods that go by blocks. This would hopefully make our code easier to grok and move towards only "one way of doing things" in groupby.

This causes a definite regression for very wide DataFrames with a small number of rows. If you look at the new benchmark created, when testing Frames of N rows x 10_000 columns where N is one of 100, 1000 or 10_000 here are the results:

       before           after         ratio
     [6b3ba986]       [124a135a]
     <categorical-regression>       <easier-groupby>
+      3.94±0.2ms        2.55±0.1s   646.23  groupby.GroupByWide.time_wide(100)
+        38.3±6ms       2.49±0.02s    64.91  groupby.GroupByWide.time_wide(1000)

So drastically slower at 100 rows, but we've already reached a break-even point by the time we get to 10,000 rows.

The rest of the benchmarks don't cover wide data frames that well but here are the results for posterity (I think these are noise)

       before           after         ratio
     [6b3ba986]       [124a135a]
     <categorical-regression>       <easier-groupby>
+        862±30μs       1.15±0.1ms     1.34  groupby.GroupByMethods.time_dtype_as_field('int', 'cummin', 'direct')
+         412±7μs         533±50μs     1.29  groupby.GroupByMethods.time_dtype_as_group('float', 'tail', 'direct')
+        300±10μs         366±10μs     1.22  groupby.GroupByMethods.time_dtype_as_group('object', 'all', 'direct')
+        375±10ms          455±6ms     1.21  groupby.GroupByMethods.time_dtype_as_group('float', 'skew', 'transformation')
+         287±6μs         335±20μs     1.17  groupby.GroupByMethods.time_dtype_as_group('object', 'shift', 'direct')
+         252±2ms         290±10ms     1.15  groupby.GroupByMethods.time_dtype_as_group('float', 'unique', 'direct')
+         414±3μs         469±10μs     1.13  groupby.GroupByMethods.time_dtype_as_group('float', 'tail', 'transformation')
+         398±9ms          450±8ms     1.13  groupby.GroupByMethods.time_dtype_as_group('float', 'skew', 'direct')

@jbrockmendel

@pep8speaks
Copy link

pep8speaks commented Nov 20, 2019

Hello @WillAyd! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-21 04:54:58 UTC

@@ -451,11 +451,7 @@ def _cython_operation(

# categoricals are only 1d, so we
# are not setup for dim transforming
if is_categorical_dtype(values) or is_sparse(values):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the "fix" for #29746 - not sure this is really the right way to do that but makes tests green for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah this definitely looks sketchy

@WillAyd WillAyd force-pushed the less-blocks-groupby branch from 4853bdd to 2eed553 Compare November 20, 2019 19:12
@WillAyd
Copy link
Member Author

WillAyd commented Nov 20, 2019

Here are graphs for the cProfile output using both methods and the below code:

>>> import pandas as pd
>>> import numpy as np
>>> import cProfile
>>> grp = pd.DataFrame(np.ones((1, 100_00))).groupby([1])
>>> cProfile.run('grp.sum()', 'old_grpbystats')

old_output
new_output

@jreback
Copy link
Contributor

jreback commented Nov 20, 2019

We absolutely do not want to remove the blocks. Just make the code simpler / easier.

@WillAyd
Copy link
Member Author

WillAyd commented Nov 20, 2019

Yea making things simpler / easier is the point of this as well. Just wasn't sure if performance of frames mostly composed of numpy types that are wider than they are larger was a huge sticking point to optimize on

I'll check out some alternates; I still think this method isn't required and adds a lot of unnecessary complexity / exception catching. It might make more sense if we allow _iterate_slices in the DataFrameGroupBy class to return an Iterable of Blocks / Series objects instead of just the latter

@jreback
Copy link
Contributor

jreback commented Nov 20, 2019

i would lie to see as a first step moving this code to internals as an iterator
would make things a lot simpler

@WillAyd
Copy link
Member Author

WillAyd commented Nov 20, 2019

Cool makes sense. Thanks for the feedback

@jbrockmendel
Copy link
Member

On the cProfile graphs are there any particular points of interest?

@jbrockmendel
Copy link
Member

@WillAyd i need to pivot to work on arithmetic for a bit, but if you're interested in taking taking my part of the groupby baton, the functions libreduction.check_result_array and libreduction._extract_result have analogues in the non-fast methods, but they behave slightly differently. I think its worthwhile to get those in alignment or at least document why they are slightly different.

@WillAyd
Copy link
Member Author

WillAyd commented Nov 21, 2019

Cool yea will do @jbrockmendel

@WillAyd WillAyd closed this Nov 23, 2019
else: # partial; specific to groupby
# TODO: func should only return one value; need to remove
# ohlc from groupby semantics to accomplish generically
result, _ = f(blk.values) # better way?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better way?

blk.apply

@jbrockmendel
Copy link
Member

are the asvs here worth salvaging?

@WillAyd WillAyd deleted the less-blocks-groupby branch January 16, 2020 00:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants