PERF: faster grouping #14294

jreback · 2016-09-24T13:53:05Z

closes #14293

jreback · 2016-09-24T13:53:22Z

cc @mrocklin
@wesm

mrocklin · 2016-09-24T14:28:10Z

Does the performance regression pointed out in #14293 (comment) still hold here?

jreback · 2016-09-24T15:43:11Z

so that was a bug, now fixed.

this is size=2**21
large is 10k groups, small is 100

· Running 8 total benchmarks (2 commits * 1 environments * 4 benchmarks)
[  0.00%] · For pandas commit hash 45b79968:
[  0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[  0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 12.50%] ··· Running groupby.groupby_groups.time_groupby_groups_int64_large                                                                                                                       279.51ms
[ 25.00%] ··· Running groupby.groupby_groups.time_groupby_groups_int64_small                                                                                                                       104.08ms
[ 37.50%] ··· Running groupby.groupby_groups.time_groupby_groups_object_large                                                                                                                      374.57ms
[ 50.00%] ··· Running groupby.groupby_groups.time_groupby_groups_object_small                                                                                                                      151.01ms
[ 50.00%] · For pandas commit hash d9e51fe7:
[ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 62.50%] ··· Running groupby.groupby_groups.time_groupby_groups_int64_large                                                                                                                       856.54ms
[ 75.00%] ··· Running groupby.groupby_groups.time_groupby_groups_int64_small                                                                                                                       674.41ms
[ 87.50%] ··· Running groupby.groupby_groups.time_groupby_groups_object_large                                                                                                                      392.38ms
[100.00%] ··· Running groupby.groupby_groups.time_groupby_groups_object_small                                                                                                                      262.49ms    

   before     after       ratio
  [d9e51fe7] [45b79968]
-  856.54ms   279.51ms      0.33  groupby.groupby_groups.time_groupby_groups_int64_large
-  674.41ms   104.08ms      0.15  groupby.groupby_groups.time_groupby_groups_int64_small
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

These are faster for object dtypes (and about 2x for _small), though the margin isn't as big as it should be I think.

codecov-io · 2016-09-24T16:07:20Z

Current coverage is 85.26% (diff: 100%)

Merging #14294 into master will decrease coverage by <.01%

@@             master     #14294   diff @@
==========================================
  Files           140        140          
  Lines         50593      50599     +6   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43137      43142     +5   
- Misses         7456       7457     +1   
  Partials          0          0

Powered by Codecov. Last update b81d444...82d19dd

jreback · 2016-09-24T23:16:01Z

these are 2*22 with 100, 10000 for small/large

   before     after       ratio
  [d9e51fe7] [87db6a4f]
-  136.59ms   107.11ms      0.78  groupby.groupby_indices.time_groupby_indices
-  996.17ms   699.80ms      0.70  groupby.groupby_groups.time_groupby_groups_object_large
-  606.62ms   319.25ms      0.53  groupby.groupby_groups.time_groupby_groups_object_small
-     1.80s   540.05ms      0.30  groupby.groupby_groups.time_groupby_groups_int64_large
-     1.49s   207.64ms      0.14  groupby.groupby_groups.time_groupby_groups_int64_small

jreback · 2016-09-26T10:46:51Z

@wesm
@jorisvandenbossche

any comments

chris-b1 · 2016-09-26T15:44:49Z

pandas/core/groupby.py

+    return a dictionary of groups -> indices
+    """
+    if not is_categorical_dtype(values):
+        values = Categorical(values)


I might be missing something, but wouldn't it possible to re-use the possibly existing factorization here?

def _groupby_indices(codes : Grouping.labels, cats : Grouping.group_index)

we are re-using the existing factorization (if its categorical already its unchanged).

I meant in the non Categorical case, if Grouping.labels is already populated, could skip factorizing again in the Categorical?

actually I see you proposed something else. .group_index is not defined except for a single Grouping. If we had a MultiGrouping, then yes I think that would work (right now that basically a list of Groupings).

Oh, I see it now, I went one class too deep. I do think you could speed up the single grouping case here - https://github.com/jreback/pandas/blob/580237924022eb74575420ad4433952c8de318dd/pandas/core/groupby.py#L2345 - make it:

return self.index.groupby(Categorical.from_codes(self.labels, self.group_index))

wesm · 2016-09-26T16:22:08Z

pandas/algos.pyx

-        seen[k] = loc + 1
+            loc = seen[k]
+            vecs[k][loc] = i
+            seen[k] = loc + 1


We already have groupsort_indexer, does this do anything different?

let me see...

wesm · 2016-09-26T16:23:06Z

pandas/indexes/base.py

+        result = _groupby_indices(to_groupby)
+
+        # map to the label
+        result = {k: self.take(v) for k, v in compat.iteritems(result)}


Are there benefits to doing this lazily?

this is not lazy, so not sure what you mean.

I'm just wondering whether producing a fully-boxed Index for each group up front has memory use or performance implications. This is something I guess we'll address in much more detail in pandas 2.0

AFAICT, this is only used on grouped.groups, so by-definition we need to box. (and that's the ONLY benchmark that changed with this PR). I though this would have broader implications (on the good side), but no-go.

jreback · 2016-09-26T22:57:47Z

roughly the same perf benefits here. with 5802379

  [7dedbed8] [a731e2dd]
-  554.34ms   269.39ms      0.49  groupby.groupby_groups.time_groupby_groups_object_small
-     1.78s   430.09ms      0.24  groupby.groupby_groups.time_groupby_groups_int64_large
-     1.32s   180.49ms      0.14  groupby.groupby_groups.time_groupby_groups_int64_small
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

remove pandas.core.groupby._groupby_indices to use algos.groupsort_indexer add Categorical._reverse_indexer to facilitate closes pandas-dev#14293

wesm · 2016-09-27T13:44:17Z

lgtm

jreback added Groupby Performance Memory or execution speed performance labels Sep 24, 2016

jreback added this to the 0.19.0 milestone Sep 24, 2016

jreback force-pushed the groupby branch from d9c7ce5 to 1f4e97f Compare September 24, 2016 13:56

jreback force-pushed the groupby branch from 1f4e97f to 45b7996 Compare September 24, 2016 15:10

jreback force-pushed the groupby branch 2 times, most recently from 5f82e6f to 1ba34fb Compare September 24, 2016 17:02

chris-b1 reviewed Sep 26, 2016

View reviewed changes

wesm reviewed Sep 26, 2016

View reviewed changes

jreback force-pushed the groupby branch from 87db6a4 to a731e2d Compare September 26, 2016 22:37

jreback force-pushed the groupby branch from a731e2d to 5802379 Compare September 26, 2016 23:05

PERF: faster grouping

82d19dd

remove pandas.core.groupby._groupby_indices to use algos.groupsort_indexer add Categorical._reverse_indexer to facilitate closes pandas-dev#14293

jreback force-pushed the groupby branch from 669d0dc to 82d19dd Compare September 27, 2016 10:39

jreback closed this in 71df09c Sep 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: faster grouping #14294

PERF: faster grouping #14294

jreback commented Sep 24, 2016

jreback commented Sep 24, 2016

mrocklin commented Sep 24, 2016

jreback commented Sep 24, 2016 •

edited

Loading

codecov-io commented Sep 24, 2016 •

edited

Loading

jreback commented Sep 24, 2016 •

edited

Loading

jreback commented Sep 26, 2016

chris-b1 Sep 26, 2016

jreback Sep 26, 2016

chris-b1 Sep 26, 2016

jreback Sep 26, 2016

chris-b1 Sep 26, 2016

wesm Sep 26, 2016

jreback Sep 26, 2016

jreback Sep 26, 2016

wesm Sep 26, 2016

jreback Sep 26, 2016

wesm Sep 27, 2016

jreback Sep 27, 2016

jreback commented Sep 26, 2016 •

edited

Loading

wesm commented Sep 27, 2016

PERF: faster grouping #14294

PERF: faster grouping #14294

Conversation

jreback commented Sep 24, 2016

jreback commented Sep 24, 2016

mrocklin commented Sep 24, 2016

jreback commented Sep 24, 2016 • edited Loading

codecov-io commented Sep 24, 2016 • edited Loading

Current coverage is 85.26% (diff: 100%)

jreback commented Sep 24, 2016 • edited Loading

jreback commented Sep 26, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Sep 26, 2016 • edited Loading

wesm commented Sep 27, 2016

jreback commented Sep 24, 2016 •

edited

Loading

codecov-io commented Sep 24, 2016 •

edited

Loading

jreback commented Sep 24, 2016 •

edited

Loading

jreback commented Sep 26, 2016 •

edited

Loading