PERF: use StringHasTable for strings #14859

jreback · 2016-12-12T00:31:17Z

provides a modest speedup for all string hashing. The key thing is, it will release the GIL on more operations where this is possible (mainly factorize).

can be easily extended to .value_counts() and .duplicated() (for strings), new issue for that one

In [9]: np.random.seed(1234)

In [10]: strings = tm.makeStringIndex(1000000)

In [11]: def f():
    ...:     for i in range(2):
    ...:         pd.factorize(strings)
    ...:         

In [12]: @tm.test_parallel(num_threads=2)
    ...: def g():
    ...:     pd.factorize(strings)
    ...:     

In [13]: %timeit f()
1 loop, best of 3: 685 ms per loop

In [14]: %timeit g()
1 loop, best of 3: 446 ms per loop

In [15]: strings = strings.take(np.random.randint(0,1000,size=len(strings)))

In [16]: strings.nunique()
Out[16]: 1000

In [17]: %timeit f()
1 loop, best of 3: 222 ms per loop

In [18]: %timeit g()
10 loops, best of 3: 190 ms per loop

jreback · 2016-12-12T00:32:50Z

cc @mrocklin

jreback · 2016-12-12T01:08:35Z

asv_bench/benchmarks/algorithms.py


        # match
        self.uniques = tm.makeStringIndex(1000).values
        self.all = self.uniques.repeat(10)

    def time_factorize_int(self):
+        self.strings.factorize()


should be factorize_string

jreback · 2016-12-12T01:09:06Z

doc/source/whatsnew/v0.20.0.txt

@@ -110,7 +110,7 @@ Removal of prior version deprecations/changes
 Performance Improvements
 ~~~~~~~~~~~~~~~~~~~~~~~~

-
+- increased performance of ``pd.factorize()`` by releasing the GIL with ``object`` dtype when inferred as strings (:issue:``)




jreback · 2016-12-12T01:10:11Z

pandas/src/hashtable_class_helper.pxi.in

@@ -1,3 +1,5 @@
+# cython: profile=True
+


change back to False

jreback · 2016-12-12T01:11:19Z

pandas/src/hashtable_class_helper.pxi.in

@@ -94,6 +99,61 @@ cdef class {{name}}Vector:

 {{endfor}}

+cdef class StringVector:
+


not actually using this so add comment /
delete

codecov-io · 2016-12-12T06:13:25Z

Current coverage is 85.31% (diff: 100%)

No coverage report found for master at 033d345.

Powered by Codecov. Last update 033d345...ade23d1

jreback · 2016-12-14T16:05:22Z

cc @mrocklin

if you can give this a stress test (and see if it helps in your perf tests), when you have a chance

mrocklin · 2016-12-14T18:14:37Z

I *think* I've narrowed down my original problem to something else. Working on isolating it further now. Will try this after that issue is removed.

…

On Wed, Dec 14, 2016 at 11:05 AM, Jeff Reback ***@***.***> wrote: cc @mrocklin <https://github.com/mrocklin> if you can give this a stress test (and see if it helps in your perf tests), when you have a chance — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14859 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszImRcGD1PokmZSZMPJFRbLyO1ToCks5rIBPLgaJpZM4LKF2s> .

allows releasing the GIL on these dtypes xref pandas-dev#13745

xref pandas-dev#13745 provides a modest speedup for all string hashing. The key thing is, it will release the GIL on more operations where this is possible (mainly factorize). can be easily extended to value_counts() and .duplicated() (for strings) Author: Jeff Reback <[email protected]> Closes pandas-dev#14859 from jreback/string and squashes the following commits: 98f46c2 [Jeff Reback] PERF: use StringHashTable for strings in factorizing

jreback added Performance Memory or execution speed performance Strings String extension data type and string data labels Dec 12, 2016

jreback added this to the 0.20.0 milestone Dec 12, 2016

jreback mentioned this pull request Dec 12, 2016

PERF: use StringHashTable for value_counts / duplicated with strings #14860

Open

2 tasks

jreback commented Dec 12, 2016

View reviewed changes

jreback force-pushed the string branch from 0102c3d to 0cb98a4 Compare December 12, 2016 11:53

jreback mentioned this pull request Dec 12, 2016

ENH: merge_asof() has type specializations and can take multiple 'by' parameters (#13936) #14783

Closed

4 tasks

jreback force-pushed the string branch from 0cb98a4 to 979ecb3 Compare December 14, 2016 13:36

jreback mentioned this pull request Dec 14, 2016

CLN: remove need for *VectorData c-structures in hashtable.pyx #14879

Closed

jreback force-pushed the string branch 2 times, most recently from a8a01d6 to ade23d1 Compare December 15, 2016 11:34

PERF: use StringHashTable for strings in factorizing

98f46c2

allows releasing the GIL on these dtypes xref pandas-dev#13745

jreback force-pushed the string branch from ade23d1 to 98f46c2 Compare December 15, 2016 23:07

jreback closed this in 3ba2cff Dec 15, 2016

chris-b1 mentioned this pull request Apr 19, 2017

BUG: memory leak in unique() with object dtype #16057

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: use StringHasTable for strings #14859

PERF: use StringHasTable for strings #14859

jreback commented Dec 12, 2016 •

edited

Loading

jreback commented Dec 12, 2016

jreback Dec 12, 2016

jreback Dec 12, 2016

jreback Dec 12, 2016

jreback Dec 12, 2016

codecov-io commented Dec 12, 2016 •

edited

Loading

jreback commented Dec 14, 2016

mrocklin commented Dec 14, 2016 via email

		@@ -94,6 +99,61 @@ cdef class {{name}}Vector:

		{{endfor}}

		cdef class StringVector:

PERF: use StringHasTable for strings #14859

PERF: use StringHasTable for strings #14859

Conversation

jreback commented Dec 12, 2016 • edited Loading

jreback commented Dec 12, 2016

jreback Dec 12, 2016

Choose a reason for hiding this comment

jreback Dec 12, 2016

Choose a reason for hiding this comment

jreback Dec 12, 2016

Choose a reason for hiding this comment

jreback Dec 12, 2016

Choose a reason for hiding this comment

codecov-io commented Dec 12, 2016 • edited Loading

Current coverage is 85.31% (diff: 100%)

jreback commented Dec 14, 2016

mrocklin commented Dec 14, 2016 via email

jreback commented Dec 12, 2016 •

edited

Loading

codecov-io commented Dec 12, 2016 •

edited

Loading