Performance of pandas.algos.groupby_int64 #14293

mrocklin · 2016-09-23T23:55:38Z

For dask.dataframe shuffle operations (groupby.apply, merge), when running with multiple threads per process, I sometimes find my computations dominated by pandas.algos.groupby_int64. Looking at the source code for this it looks like it's using dynamic pure python objects from Cython. I'm curious if there are ways to accelerate this function, particularly in multi-threaded situations (releasing the GIL).

One solution that comes to mind would be to do a single pass over labels, pre-compute the length of each members list in results and then pre-allocate these as arrays. This might allow better GIL-releasing behavior.

Thoughts?

The text was updated successfully, but these errors were encountered:

jreback · 2016-09-23T23:59:06Z

can u show a particular example that you have been timing (so all on the same page)

wesm · 2016-09-24T00:15:47Z

Those groupby_* functions are a real bummer.

The right way to do this that avoids the GIL (assuming that labels does not contain Python objects):

Specialize on the labels array type
Use a native hash table rather than a Python dict
Do not use Python lists
Compute categories, then do a O(n) stable counting sort (depending on the cardinality, if large cardinality do a merge sort), then you can do the groupby.apply take operations using the counting sort array.

For example, if the labels are:

[a, b, c, a, b, c, a, b, c]

Then you factorize (which will cause GIL contention if object dtype) to get

[0, 1, 2, 0, 1, 2, 0, 1, 2]

Stable argsort (either by counting sort or mergesort) yields

[0, 3, 6, 1, 4, 7, 2, 5, 8]

Now, you iterate through this array to delimit each contiguous group.

0, 3, 6 are all 0's
when you hit 1 (index 3 in the argsort array) you know you have reached the end of the group
and so on

wesm · 2016-09-24T00:17:17Z

Note that in pandas 2.0, factorizing strings (assuming we implement https://pydata.github.io/pandas-design/strings.html) will not require the GIL -> multicore happiness.

mrocklin · 2016-09-24T12:27:15Z

@jreback

import numpy as np
import pandas as pd
s = pd.Series(np.random.randint(0, 100, size=2**24))
s.groupby(s).groups

         412 function calls (408 primitive calls) in 5.913 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    5.120    5.120    5.120    5.120 {pandas.algos.groupby_int64}
        1    0.792    0.792    5.913    5.913 <string>:1(<module>)
        1    0.000    0.000    5.120    5.120 base.py:2355(groupby)

mrocklin · 2016-09-24T12:32:55Z

@wesm FWIW I only care about groupby_int*. For me the index here is the eventual partition number so labels are always in 0..n. Obviously you don't want to assume this for general pandas groupby algorithms but it might be useful to pull out the argsort-partition logic into a separate function.

jreback · 2016-09-24T13:10:15Z

master

In [1]: np.random.seed(1234)
   ...: 
   ...: s = pd.Series(np.random.randint(0, 100, size=2**24))
   ...: 

In [2]: %timeit -n 1 -r 1 s.groupby(s).groups
1 loop, best of 1: 5.26 s per loop

In [3]: s.groupby(s).groups[0][-10:]
Out[3]: 
[16776669,
 16776672,
 16776713,
 16776752,
 16776875,
 16777047,
 16777110,
 16777131,
 16777139,
 16777165]

In [4]: pd.__version__
Out[4]: '0.19.0rc1+31.gd9e51fe'

patch

In [1]: np.random.seed(1234)
   ...: 
   ...: s = pd.Series(np.random.randint(0, 100, size=2**24))
   ...: 

In [2]: %timeit -n 1 -r 1 s.groupby(s).groups
1 loop, best of 1: 742 ms per loop

In [3]: s.groupby(s).groups[0][-10:]
Out[3]: 
array([16776669, 16776672, 16776713, 16776752, 16776875, 16777047,
       16777110, 16777131, 16777139, 16777165])

In [4]: pd.__version__
Out[4]: '0.19.0rc1+32.g6256f2f'

[jreback-~/pandas] git diff master
diff --git a/pandas/algos.pyx b/pandas/algos.pyx
index 8710ef3..0a1e806 100644
--- a/pandas/algos.pyx
+++ b/pandas/algos.pyx
@@ -1024,16 +1024,17 @@ def groupby_indices(dict ids, ndarray[int64_t] labels,
         result[ids[i]] = arr
         vecs[i] = <int64_t*> arr.data

-    for i from 0 <= i < n:
-        k = labels[i]
+    with nogil:
+        for i from 0 <= i < n:
+            k = labels[i]

-        # was NaN
-        if k == -1:
-            continue
+            # was NaN
+            if k == -1:
+                continue

-        loc = seen[k]
-        vecs[k][loc] = i
-        seen[k] = loc + 1
+            loc = seen[k]
+            vecs[k][loc] = i
+            seen[k] = loc + 1

     free(vecs)
     return result
diff --git a/pandas/indexes/base.py b/pandas/indexes/base.py
index f430305..16fe13c 100644
--- a/pandas/indexes/base.py
+++ b/pandas/indexes/base.py
@@ -2366,7 +2366,11 @@ class Index(IndexOpsMixin, StringAccessorMixin, PandasObject):
         groups : dict
             {group name -> group labels}
         """
-        return self._groupby(self.values, _values_from_object(to_groupby))
+        from pandas import Categorical
+        from pandas.core.groupby import _groupby_indices
+        result = _groupby_indices(Categorical(to_groupby))
+
+        return result

     def map(self, mapper):
         """

jreback · 2016-09-24T13:11:28Z

I had added a special case for categorical grouping because merge_asof needed it (it make things a lot faster). So just converting to categoricals and using it here is a pretty good speedup.

a couple of minor tests fail with this (edge case with all nan categories), but no big deal to fix.

This still hits a dictionary encoding path (in cython), but could have the GIL released for part of it.

jreback · 2016-09-24T13:17:16Z

you should get gil releasing in the factorize and now the groupby_indices (these are about 2/3 of the time), rest is python-ish

mrocklin · 2016-09-24T13:18:53Z

That performance gain would definitely resolve my immediate needs and likely move Pandas well away from being a bottleneck.

jreback · 2016-09-24T13:21:11Z

I pushed it up: https://github.com/jreback/pandas/tree/groupby

(as s I said, running some perf numbers and a couple of edge cases), but give it a go

jreback · 2016-09-24T13:47:17Z

  [d9e51fe7] [3da4a8d7]
+  610.20μs     2.54ms      4.17  groupby.groupby_ngroups_float_100.time_sum
+    2.91ms    11.70ms      4.02  groupby.groupby_ngroups_float_10000.time_count
+   12.49ms    45.73ms      3.66  groupby.groupby_ngroups_float_100.time_unique
+    1.50ms     5.24ms      3.50  groupby.groupby_ngroups_float_100.time_tail
+  484.40ms      1.48s      3.06  groupby.groupby_multi_index.time_groupby_multi_index
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

remove pandas.core.groupby._groupby_indices to use algos.groupsort_indexer add Categorical._reverse_indexer to facilitate closes pandas-dev#14293

jreback added Groupby Performance Memory or execution speed performance labels Sep 24, 2016

jreback added this to the 0.19.0 milestone Sep 24, 2016

jreback mentioned this issue Sep 24, 2016

PERF: faster grouping #14294

Closed

jreback added a commit to jreback/pandas that referenced this issue Sep 27, 2016

PERF: faster grouping

82d19dd

remove pandas.core.groupby._groupby_indices to use algos.groupsort_indexer add Categorical._reverse_indexer to facilitate closes pandas-dev#14293

jreback closed this as completed in 71df09c Sep 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of pandas.algos.groupby_int64 #14293

Performance of pandas.algos.groupby_int64 #14293

mrocklin commented Sep 23, 2016

jreback commented Sep 23, 2016

wesm commented Sep 24, 2016

wesm commented Sep 24, 2016

mrocklin commented Sep 24, 2016

mrocklin commented Sep 24, 2016

jreback commented Sep 24, 2016 •

edited

Loading

jreback commented Sep 24, 2016 •

edited

Loading

jreback commented Sep 24, 2016

mrocklin commented Sep 24, 2016

jreback commented Sep 24, 2016

jreback commented Sep 24, 2016

Performance of pandas.algos.groupby_int64 #14293

Performance of pandas.algos.groupby_int64 #14293

Comments

mrocklin commented Sep 23, 2016

jreback commented Sep 23, 2016

wesm commented Sep 24, 2016

wesm commented Sep 24, 2016

mrocklin commented Sep 24, 2016

mrocklin commented Sep 24, 2016

jreback commented Sep 24, 2016 • edited Loading

jreback commented Sep 24, 2016 • edited Loading

jreback commented Sep 24, 2016

mrocklin commented Sep 24, 2016

jreback commented Sep 24, 2016

jreback commented Sep 24, 2016

jreback commented Sep 24, 2016 •

edited

Loading

jreback commented Sep 24, 2016 •

edited

Loading