-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Use String HashTable in value_counts #52113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ng-hashtable-algos
…ng-hashtable-algos
('uint16', 'UInt16', 'uint16', 'uint16', 'uint16_t', ''), | ||
('uint8', 'UInt8', 'uint8', 'uint8', 'uint8_t', ''), | ||
('object', 'Object', 'object', 'pymap', 'object', ''), | ||
('string', 'Object', 'object', 'str', 'char *', 'util.get_c_string'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can make this const char *
{{if dtype == 'object'}} | ||
k = kh_get_{{ttype}}(table, result_keys.data[i]) | ||
val = <object>result_keys.data[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did the object branch need to change?
@@ -53,9 +53,9 @@ cdef extern from "khash_python.h": | |||
kh_pymap_t* kh_init_pymap() | |||
void kh_destroy_pymap(kh_pymap_t*) | |||
void kh_clear_pymap(kh_pymap_t*) | |||
khuint_t kh_get_pymap(kh_pymap_t*, PyObject*) | |||
khuint_t kh_get_pymap(kh_pymap_t*, object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just a cleanup or did it actually change the behavior in some way? I have a slight preference for the extern declarations to match exactly the way they are declared in the source headers to minimize having to interpret what is happening; at the end of the day the linker isn't going to type check just resolve the symbols
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
@@ -283,6 +299,8 @@ ctypedef fused htfunc_t: | |||
|
|||
cpdef value_count(ndarray[htfunc_t] values, bint dropna, const uint8_t[:] mask=None): | |||
if htfunc_t is object: | |||
if lib.is_string_array(values, skipna=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a viable way to avoid needing lib
here? this file is relatively low-dependency and lib is the opposite
Closing to clear the queue, but feel free to reopen when you have time to revisit |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Roughly around 20-40% speedup in the benchmarkI added for largish 1000/10000 arrays. Benchmarks are pretty flaky, though.
duplicated/ismember didn't show improvement (actually regressed). I guess we aren't using the hashtable enough there to make up for the cost of doing
is_string_array
.