Less blocks in groupby #29753

WillAyd · 2019-11-20T19:08:23Z

This isn't complete (2 tests fail) but this is arguably controversial, so putting up for review.

With #29124 we could get rid of the DataFrame aggregation methods that go by blocks. This would hopefully make our code easier to grok and move towards only "one way of doing things" in groupby.

This causes a definite regression for very wide DataFrames with a small number of rows. If you look at the new benchmark created, when testing Frames of N rows x 10_000 columns where N is one of 100, 1000 or 10_000 here are the results:

       before           after         ratio
     [6b3ba986]       [124a135a]
     <categorical-regression>       <easier-groupby>
+      3.94±0.2ms        2.55±0.1s   646.23  groupby.GroupByWide.time_wide(100)
+        38.3±6ms       2.49±0.02s    64.91  groupby.GroupByWide.time_wide(1000)

So drastically slower at 100 rows, but we've already reached a break-even point by the time we get to 10,000 rows.

The rest of the benchmarks don't cover wide data frames that well but here are the results for posterity (I think these are noise)

       before           after         ratio
     [6b3ba986]       [124a135a]
     <categorical-regression>       <easier-groupby>
+        862±30μs       1.15±0.1ms     1.34  groupby.GroupByMethods.time_dtype_as_field('int', 'cummin', 'direct')
+         412±7μs         533±50μs     1.29  groupby.GroupByMethods.time_dtype_as_group('float', 'tail', 'direct')
+        300±10μs         366±10μs     1.22  groupby.GroupByMethods.time_dtype_as_group('object', 'all', 'direct')
+        375±10ms          455±6ms     1.21  groupby.GroupByMethods.time_dtype_as_group('float', 'skew', 'transformation')
+         287±6μs         335±20μs     1.17  groupby.GroupByMethods.time_dtype_as_group('object', 'shift', 'direct')
+         252±2ms         290±10ms     1.15  groupby.GroupByMethods.time_dtype_as_group('float', 'unique', 'direct')
+         414±3μs         469±10μs     1.13  groupby.GroupByMethods.time_dtype_as_group('float', 'tail', 'transformation')
+         398±9ms          450±8ms     1.13  groupby.GroupByMethods.time_dtype_as_group('float', 'skew', 'direct')

@jbrockmendel

pep8speaks · 2019-11-20T19:08:31Z

Hello @WillAyd! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-21 04:54:58 UTC

WillAyd · 2019-11-20T19:09:10Z

pandas/core/groupby/ops.py

@@ -451,11 +451,7 @@ def _cython_operation(

        # categoricals are only 1d, so we
        # are not setup for dim transforming
-        if is_categorical_dtype(values) or is_sparse(values):


This was the "fix" for #29746 - not sure this is really the right way to do that but makes tests green for now

yah this definitely looks sketchy

WillAyd · 2019-11-20T19:18:11Z

Here are graphs for the cProfile output using both methods and the below code:

>>> import pandas as pd
>>> import numpy as np
>>> import cProfile
>>> grp = pd.DataFrame(np.ones((1, 100_00))).groupby([1])
>>> cProfile.run('grp.sum()', 'old_grpbystats')

jreback · 2019-11-20T19:19:58Z

We absolutely do not want to remove the blocks. Just make the code simpler / easier.

WillAyd · 2019-11-20T20:02:31Z

Yea making things simpler / easier is the point of this as well. Just wasn't sure if performance of frames mostly composed of numpy types that are wider than they are larger was a huge sticking point to optimize on

I'll check out some alternates; I still think this method isn't required and adds a lot of unnecessary complexity / exception catching. It might make more sense if we allow _iterate_slices in the DataFrameGroupBy class to return an Iterable of Blocks / Series objects instead of just the latter

jreback · 2019-11-20T20:31:20Z

i would lie to see as a first step moving this code to internals as an iterator
would make things a lot simpler

WillAyd · 2019-11-20T20:32:02Z

Cool makes sense. Thanks for the feedback

jbrockmendel · 2019-11-21T00:14:49Z

On the cProfile graphs are there any particular points of interest?

jbrockmendel · 2019-11-21T00:19:09Z

@WillAyd i need to pivot to work on arithmetic for a bit, but if you're interested in taking taking my part of the groupby baton, the functions libreduction.check_result_array and libreduction._extract_result have analogues in the non-fast methods, but they behave slightly differently. I think its worthwhile to get those in alignment or at least document why they are slightly different.

WillAyd · 2019-11-21T01:52:27Z

Cool yea will do @jbrockmendel

jbrockmendel · 2019-11-25T16:23:19Z

pandas/core/internals/managers.py

+            else:  # partial; specific to groupby
+                # TODO: func should only return one value; need to remove
+                # ohlc from groupby semantics to accomplish generically
+                result, _ = f(blk.values)  # better way?


better way?

blk.apply

jbrockmendel · 2019-11-25T16:23:35Z

are the asvs here worth salvaging?

WillAyd added 4 commits November 20, 2019 08:43

Removed block management

57f418e

updated test expectation for float -> int inference

615351f

Removed raising on Categorical (???)

4460f81

Added benchmark for wide frame

124a135

WillAyd commented Nov 20, 2019

View reviewed changes

WillAyd added 2 commits November 20, 2019 11:10

Wide benchmarks

658ea0f

black

2eed553

WillAyd force-pushed the less-blocks-groupby branch from 4853bdd to 2eed553 Compare November 20, 2019 19:12

Merge remote-tracking branch 'upstream/master' into less-blocks-groupby

3823eb5

WillAyd mentioned this pull request Nov 21, 2019

GroupBy Doesn't Always Maintain Column Index Name #29764

Closed

WillAyd added 4 commits November 20, 2019 20:50

touching internals

d04a089

Merge branch 'blocks-apply' into less-blocks-groupby

e7c593a

Reverted categorical test for now

3e33072

reverted categorical file for now

553c961

WillAyd closed this Nov 23, 2019

jbrockmendel reviewed Nov 25, 2019

View reviewed changes

WillAyd deleted the less-blocks-groupby branch January 16, 2020 00:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Less blocks in groupby #29753

Less blocks in groupby #29753

WillAyd commented Nov 20, 2019

pep8speaks commented Nov 20, 2019 •

edited

Loading

WillAyd Nov 20, 2019

jbrockmendel Nov 20, 2019

WillAyd commented Nov 20, 2019

jreback commented Nov 20, 2019

WillAyd commented Nov 20, 2019

jreback commented Nov 20, 2019

WillAyd commented Nov 20, 2019

jbrockmendel commented Nov 21, 2019

jbrockmendel commented Nov 21, 2019

WillAyd commented Nov 21, 2019

jbrockmendel Nov 25, 2019

jbrockmendel commented Nov 25, 2019

Less blocks in groupby #29753

Less blocks in groupby #29753

Conversation

WillAyd commented Nov 20, 2019

pep8speaks commented Nov 20, 2019 • edited Loading

Comment last updated at 2019-11-21 04:54:58 UTC

WillAyd Nov 20, 2019

Choose a reason for hiding this comment

jbrockmendel Nov 20, 2019

Choose a reason for hiding this comment

WillAyd commented Nov 20, 2019

jreback commented Nov 20, 2019

WillAyd commented Nov 20, 2019

jreback commented Nov 20, 2019

WillAyd commented Nov 20, 2019

jbrockmendel commented Nov 21, 2019

jbrockmendel commented Nov 21, 2019

WillAyd commented Nov 21, 2019

jbrockmendel Nov 25, 2019

Choose a reason for hiding this comment

better way?

jbrockmendel commented Nov 25, 2019

pep8speaks commented Nov 20, 2019 •

edited

Loading