Skip to content

Performance of pandas.algos.groupby_int64 #14293

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mrocklin opened this issue Sep 23, 2016 · 11 comments
Closed

Performance of pandas.algos.groupby_int64 #14293

mrocklin opened this issue Sep 23, 2016 · 11 comments
Labels
Groupby Performance Memory or execution speed performance
Milestone

Comments

@mrocklin
Copy link
Contributor

For dask.dataframe shuffle operations (groupby.apply, merge), when running with multiple threads per process, I sometimes find my computations dominated by pandas.algos.groupby_int64. Looking at the source code for this it looks like it's using dynamic pure python objects from Cython. I'm curious if there are ways to accelerate this function, particularly in multi-threaded situations (releasing the GIL).

One solution that comes to mind would be to do a single pass over labels, pre-compute the length of each members list in results and then pre-allocate these as arrays. This might allow better GIL-releasing behavior.

Thoughts?

@jreback
Copy link
Contributor

jreback commented Sep 23, 2016

can u show a particular example that you have been timing (so all on the same page)

@wesm
Copy link
Member

wesm commented Sep 24, 2016

Those groupby_* functions are a real bummer.

The right way to do this that avoids the GIL (assuming that labels does not contain Python objects):

  • Specialize on the labels array type
  • Use a native hash table rather than a Python dict
  • Do not use Python lists
  • Compute categories, then do a O(n) stable counting sort (depending on the cardinality, if large cardinality do a merge sort), then you can do the groupby.apply take operations using the counting sort array.

For example, if the labels are:

[a, b, c, a, b, c, a, b, c]

Then you factorize (which will cause GIL contention if object dtype) to get

[0, 1, 2, 0, 1, 2, 0, 1, 2]

Stable argsort (either by counting sort or mergesort) yields

[0, 3, 6, 1, 4, 7, 2, 5, 8]

Now, you iterate through this array to delimit each contiguous group.

  • 0, 3, 6 are all 0's
  • when you hit 1 (index 3 in the argsort array) you know you have reached the end of the group
  • and so on

@wesm
Copy link
Member

wesm commented Sep 24, 2016

Note that in pandas 2.0, factorizing strings (assuming we implement https://pydata.github.io/pandas-design/strings.html) will not require the GIL -> multicore happiness.

@mrocklin
Copy link
Contributor Author

@jreback

import numpy as np
import pandas as pd
s = pd.Series(np.random.randint(0, 100, size=2**24))
s.groupby(s).groups
         412 function calls (408 primitive calls) in 5.913 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    5.120    5.120    5.120    5.120 {pandas.algos.groupby_int64}
        1    0.792    0.792    5.913    5.913 <string>:1(<module>)
        1    0.000    0.000    5.120    5.120 base.py:2355(groupby)

@mrocklin
Copy link
Contributor Author

@wesm FWIW I only care about groupby_int*. For me the index here is the eventual partition number so labels are always in 0..n. Obviously you don't want to assume this for general pandas groupby algorithms but it might be useful to pull out the argsort-partition logic into a separate function.

@jreback
Copy link
Contributor

jreback commented Sep 24, 2016

master

In [1]: np.random.seed(1234)
   ...: 
   ...: s = pd.Series(np.random.randint(0, 100, size=2**24))
   ...: 

In [2]: %timeit -n 1 -r 1 s.groupby(s).groups
1 loop, best of 1: 5.26 s per loop

In [3]: s.groupby(s).groups[0][-10:]
Out[3]: 
[16776669,
 16776672,
 16776713,
 16776752,
 16776875,
 16777047,
 16777110,
 16777131,
 16777139,
 16777165]

In [4]: pd.__version__
Out[4]: '0.19.0rc1+31.gd9e51fe'

patch

In [1]: np.random.seed(1234)
   ...: 
   ...: s = pd.Series(np.random.randint(0, 100, size=2**24))
   ...: 

In [2]: %timeit -n 1 -r 1 s.groupby(s).groups
1 loop, best of 1: 742 ms per loop

In [3]: s.groupby(s).groups[0][-10:]
Out[3]: 
array([16776669, 16776672, 16776713, 16776752, 16776875, 16777047,
       16777110, 16777131, 16777139, 16777165])

In [4]: pd.__version__
Out[4]: '0.19.0rc1+32.g6256f2f'
[jreback-~/pandas] git diff master
diff --git a/pandas/algos.pyx b/pandas/algos.pyx
index 8710ef3..0a1e806 100644
--- a/pandas/algos.pyx
+++ b/pandas/algos.pyx
@@ -1024,16 +1024,17 @@ def groupby_indices(dict ids, ndarray[int64_t] labels,
         result[ids[i]] = arr
         vecs[i] = <int64_t*> arr.data

-    for i from 0 <= i < n:
-        k = labels[i]
+    with nogil:
+        for i from 0 <= i < n:
+            k = labels[i]

-        # was NaN
-        if k == -1:
-            continue
+            # was NaN
+            if k == -1:
+                continue

-        loc = seen[k]
-        vecs[k][loc] = i
-        seen[k] = loc + 1
+            loc = seen[k]
+            vecs[k][loc] = i
+            seen[k] = loc + 1

     free(vecs)
     return result
diff --git a/pandas/indexes/base.py b/pandas/indexes/base.py
index f430305..16fe13c 100644
--- a/pandas/indexes/base.py
+++ b/pandas/indexes/base.py
@@ -2366,7 +2366,11 @@ class Index(IndexOpsMixin, StringAccessorMixin, PandasObject):
         groups : dict
             {group name -> group labels}
         """
-        return self._groupby(self.values, _values_from_object(to_groupby))
+        from pandas import Categorical
+        from pandas.core.groupby import _groupby_indices
+        result = _groupby_indices(Categorical(to_groupby))
+
+        return result

     def map(self, mapper):
         """

@jreback
Copy link
Contributor

jreback commented Sep 24, 2016

I had added a special case for categorical grouping because merge_asof needed it (it make things a lot faster). So just converting to categoricals and using it here is a pretty good speedup.

a couple of minor tests fail with this (edge case with all nan categories), but no big deal to fix.

This still hits a dictionary encoding path (in cython), but could have the GIL released for part of it.

@jreback
Copy link
Contributor

jreback commented Sep 24, 2016

you should get gil releasing in the factorize and now the groupby_indices (these are about 2/3 of the time), rest is python-ish

@mrocklin
Copy link
Contributor Author

That performance gain would definitely resolve my immediate needs and likely move Pandas well away from being a bottleneck.

@jreback
Copy link
Contributor

jreback commented Sep 24, 2016

I pushed it up: https://github.com/jreback/pandas/tree/groupby

(as s I said, running some perf numbers and a couple of edge cases), but give it a go

@jreback
Copy link
Contributor

jreback commented Sep 24, 2016

  [d9e51fe7] [3da4a8d7]
+  610.20μs     2.54ms      4.17  groupby.groupby_ngroups_float_100.time_sum
+    2.91ms    11.70ms      4.02  groupby.groupby_ngroups_float_10000.time_count
+   12.49ms    45.73ms      3.66  groupby.groupby_ngroups_float_100.time_unique
+    1.50ms     5.24ms      3.50  groupby.groupby_ngroups_float_100.time_tail
+  484.40ms      1.48s      3.06  groupby.groupby_multi_index.time_groupby_multi_index
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@jreback jreback added Groupby Performance Memory or execution speed performance labels Sep 24, 2016
@jreback jreback added this to the 0.19.0 milestone Sep 24, 2016
jreback added a commit to jreback/pandas that referenced this issue Sep 27, 2016
remove pandas.core.groupby._groupby_indices to use algos.groupsort_indexer
add Categorical._reverse_indexer to facilitate

closes pandas-dev#14293
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants