Skip to content

PERF: use StringHasTable for strings #14859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Dec 12, 2016

xref #13745

provides a modest speedup for all string hashing. The key thing is, it will release the GIL on more operations where this is possible (mainly factorize).

can be easily extended to .value_counts() and .duplicated() (for strings), new issue for that one

In [9]: np.random.seed(1234)

In [10]: strings = tm.makeStringIndex(1000000)

In [11]: def f():
    ...:     for i in range(2):
    ...:         pd.factorize(strings)
    ...:         

In [12]: @tm.test_parallel(num_threads=2)
    ...: def g():
    ...:     pd.factorize(strings)
    ...:     

In [13]: %timeit f()
1 loop, best of 3: 685 ms per loop

In [14]: %timeit g()
1 loop, best of 3: 446 ms per loop

In [15]: strings = strings.take(np.random.randint(0,1000,size=len(strings)))

In [16]: strings.nunique()
Out[16]: 1000

In [17]: %timeit f()
1 loop, best of 3: 222 ms per loop

In [18]: %timeit g()
10 loops, best of 3: 190 ms per loop

@jreback jreback added Performance Memory or execution speed performance Strings String extension data type and string data labels Dec 12, 2016
@jreback jreback added this to the 0.20.0 milestone Dec 12, 2016
@jreback
Copy link
Contributor Author

jreback commented Dec 12, 2016

cc @mrocklin


# match
self.uniques = tm.makeStringIndex(1000).values
self.all = self.uniques.repeat(10)

def time_factorize_int(self):
self.strings.factorize()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be factorize_string

@@ -110,7 +110,7 @@ Removal of prior version deprecations/changes
Performance Improvements
~~~~~~~~~~~~~~~~~~~~~~~~


- increased performance of ``pd.factorize()`` by releasing the GIL with ``object`` dtype when inferred as strings (:issue:``)


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add number

@@ -1,3 +1,5 @@
# cython: profile=True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change back to False

@@ -94,6 +99,61 @@ cdef class {{name}}Vector:

{{endfor}}

cdef class StringVector:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not actually using this so add comment /
delete

@codecov-io
Copy link

codecov-io commented Dec 12, 2016

Current coverage is 85.31% (diff: 100%)

No coverage report found for master at 033d345.

Powered by Codecov. Last update 033d345...ade23d1

@jreback
Copy link
Contributor Author

jreback commented Dec 14, 2016

cc @mrocklin

if you can give this a stress test (and see if it helps in your perf tests), when you have a chance

@mrocklin
Copy link
Contributor

mrocklin commented Dec 14, 2016 via email

@jreback jreback force-pushed the string branch 2 times, most recently from a8a01d6 to ade23d1 Compare December 15, 2016 11:34
allows releasing the GIL on these dtypes

xref pandas-dev#13745
@jreback jreback closed this in 3ba2cff Dec 15, 2016
ischurov pushed a commit to ischurov/pandas that referenced this pull request Dec 19, 2016
xref pandas-dev#13745

provides a modest speedup for all string hashing. The
key thing is, it will release the GIL on more operations where this is
possible (mainly factorize).
can be easily extended to value_counts() and .duplicated() (for strings)

Author: Jeff Reback <[email protected]>

Closes pandas-dev#14859 from jreback/string and squashes the following commits:

98f46c2 [Jeff Reback] PERF: use StringHashTable for strings in factorizing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants