Skip to content

PERF: faster grouping #14294

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

PERF: faster grouping #14294

wants to merge 1 commit into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Sep 24, 2016

closes #14293

@jreback jreback added Groupby Performance Memory or execution speed performance labels Sep 24, 2016
@jreback jreback added this to the 0.19.0 milestone Sep 24, 2016
@jreback
Copy link
Contributor Author

jreback commented Sep 24, 2016

cc @mrocklin
@wesm

@mrocklin
Copy link
Contributor

Does the performance regression pointed out in #14293 (comment) still hold here?

@jreback
Copy link
Contributor Author

jreback commented Sep 24, 2016

so that was a bug, now fixed.

this is size=2**21
large is 10k groups, small is 100

· Running 8 total benchmarks (2 commits * 1 environments * 4 benchmarks)
[  0.00%] · For pandas commit hash 45b79968:
[  0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[  0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 12.50%] ··· Running groupby.groupby_groups.time_groupby_groups_int64_large                                                                                                                       279.51ms
[ 25.00%] ··· Running groupby.groupby_groups.time_groupby_groups_int64_small                                                                                                                       104.08ms
[ 37.50%] ··· Running groupby.groupby_groups.time_groupby_groups_object_large                                                                                                                      374.57ms
[ 50.00%] ··· Running groupby.groupby_groups.time_groupby_groups_object_small                                                                                                                      151.01ms
[ 50.00%] · For pandas commit hash d9e51fe7:
[ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 62.50%] ··· Running groupby.groupby_groups.time_groupby_groups_int64_large                                                                                                                       856.54ms
[ 75.00%] ··· Running groupby.groupby_groups.time_groupby_groups_int64_small                                                                                                                       674.41ms
[ 87.50%] ··· Running groupby.groupby_groups.time_groupby_groups_object_large                                                                                                                      392.38ms
[100.00%] ··· Running groupby.groupby_groups.time_groupby_groups_object_small                                                                                                                      262.49ms    

   before     after       ratio
  [d9e51fe7] [45b79968]
-  856.54ms   279.51ms      0.33  groupby.groupby_groups.time_groupby_groups_int64_large
-  674.41ms   104.08ms      0.15  groupby.groupby_groups.time_groupby_groups_int64_small
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

These are faster for object dtypes (and about 2x for _small), though the margin isn't as big as it should be I think.

@codecov-io
Copy link

codecov-io commented Sep 24, 2016

Current coverage is 85.26% (diff: 100%)

Merging #14294 into master will decrease coverage by <.01%

@@             master     #14294   diff @@
==========================================
  Files           140        140          
  Lines         50593      50599     +6   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43137      43142     +5   
- Misses         7456       7457     +1   
  Partials          0          0          

Powered by Codecov. Last update b81d444...82d19dd

@jreback jreback force-pushed the groupby branch 2 times, most recently from 5f82e6f to 1ba34fb Compare September 24, 2016 17:02
@jreback
Copy link
Contributor Author

jreback commented Sep 24, 2016

these are 2*22 with 100, 10000 for small/large

   before     after       ratio
  [d9e51fe7] [87db6a4f]
-  136.59ms   107.11ms      0.78  groupby.groupby_indices.time_groupby_indices
-  996.17ms   699.80ms      0.70  groupby.groupby_groups.time_groupby_groups_object_large
-  606.62ms   319.25ms      0.53  groupby.groupby_groups.time_groupby_groups_object_small
-     1.80s   540.05ms      0.30  groupby.groupby_groups.time_groupby_groups_int64_large
-     1.49s   207.64ms      0.14  groupby.groupby_groups.time_groupby_groups_int64_small

@jreback
Copy link
Contributor Author

jreback commented Sep 26, 2016

@wesm
@jorisvandenbossche

any comments

return a dictionary of groups -> indices
"""
if not is_categorical_dtype(values):
values = Categorical(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something, but wouldn't it possible to re-use the possibly existing factorization here?

def _groupby_indices(codes : Grouping.labels,  cats : Grouping.group_index)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are re-using the existing factorization (if its categorical already its unchanged).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant in the non Categorical case, if Grouping.labels is already populated, could skip factorizing again in the Categorical?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I see you proposed something else. .group_index is not defined except for a single Grouping. If we had a MultiGrouping, then yes I think that would work (right now that basically a list of Groupings).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see it now, I went one class too deep. I do think you could speed up the single grouping case here - https://github.com/jreback/pandas/blob/580237924022eb74575420ad4433952c8de318dd/pandas/core/groupby.py#L2345 - make it:

return self.index.groupby(Categorical.from_codes(self.labels, self.group_index))

seen[k] = loc + 1
loc = seen[k]
vecs[k][loc] = i
seen[k] = loc + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have groupsort_indexer, does this do anything different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me see...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

result = _groupby_indices(to_groupby)

# map to the label
result = {k: self.take(v) for k, v in compat.iteritems(result)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there benefits to doing this lazily?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not lazy, so not sure what you mean.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just wondering whether producing a fully-boxed Index for each group up front has memory use or performance implications. This is something I guess we'll address in much more detail in pandas 2.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, this is only used on grouped.groups, so by-definition we need to box. (and that's the ONLY benchmark that changed with this PR). I though this would have broader implications (on the good side), but no-go.

@jreback
Copy link
Contributor Author

jreback commented Sep 26, 2016

roughly the same perf benefits here. with 5802379

  [7dedbed8] [a731e2dd]
-  554.34ms   269.39ms      0.49  groupby.groupby_groups.time_groupby_groups_object_small
-     1.78s   430.09ms      0.24  groupby.groupby_groups.time_groupby_groups_int64_large
-     1.32s   180.49ms      0.14  groupby.groupby_groups.time_groupby_groups_int64_small
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

remove pandas.core.groupby._groupby_indices to use algos.groupsort_indexer
add Categorical._reverse_indexer to facilitate

closes pandas-dev#14293
@wesm
Copy link
Member

wesm commented Sep 27, 2016

lgtm

@jreback jreback closed this in 71df09c Sep 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance of pandas.algos.groupby_int64
5 participants