-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: use StringHasTable for strings #14859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @mrocklin |
|
||
# match | ||
self.uniques = tm.makeStringIndex(1000).values | ||
self.all = self.uniques.repeat(10) | ||
|
||
def time_factorize_int(self): | ||
self.strings.factorize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be factorize_string
@@ -110,7 +110,7 @@ Removal of prior version deprecations/changes | |||
Performance Improvements | |||
~~~~~~~~~~~~~~~~~~~~~~~~ | |||
|
|||
|
|||
- increased performance of ``pd.factorize()`` by releasing the GIL with ``object`` dtype when inferred as strings (:issue:``) | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add number
@@ -1,3 +1,5 @@ | |||
# cython: profile=True | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change back to False
@@ -94,6 +99,61 @@ cdef class {{name}}Vector: | |||
|
|||
{{endfor}} | |||
|
|||
cdef class StringVector: | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not actually using this so add comment /
delete
Current coverage is 85.31% (diff: 100%)
|
cc @mrocklin if you can give this a stress test (and see if it helps in your perf tests), when you have a chance |
I *think* I've narrowed down my original problem to something else.
Working on isolating it further now. Will try this after that issue is
removed.
…On Wed, Dec 14, 2016 at 11:05 AM, Jeff Reback ***@***.***> wrote:
cc @mrocklin <https://github.com/mrocklin>
if you can give this a stress test (and see if it helps in your perf
tests), when you have a chance
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14859 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszImRcGD1PokmZSZMPJFRbLyO1ToCks5rIBPLgaJpZM4LKF2s>
.
|
a8a01d6
to
ade23d1
Compare
allows releasing the GIL on these dtypes xref pandas-dev#13745
xref pandas-dev#13745 provides a modest speedup for all string hashing. The key thing is, it will release the GIL on more operations where this is possible (mainly factorize). can be easily extended to value_counts() and .duplicated() (for strings) Author: Jeff Reback <[email protected]> Closes pandas-dev#14859 from jreback/string and squashes the following commits: 98f46c2 [Jeff Reback] PERF: use StringHashTable for strings in factorizing
xref #13745
provides a modest speedup for all string hashing. The key thing is, it will release the GIL on more operations where this is possible (mainly
factorize
).can be easily extended to
.value_counts()
and.duplicated()
(for strings), new issue for that one