Skip to content

PERF: Use String HashTable in value_counts #52113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

lithomas1
Copy link
Member

Roughly around 20-40% speedup in the benchmarkI added for largish 1000/10000 arrays. Benchmarks are pretty flaky, though.

duplicated/ismember didn't show improvement (actually regressed). I guess we aren't using the hashtable enough there to make up for the cost of doing is_string_array.

@lithomas1 lithomas1 added Performance Memory or execution speed performance Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Mar 22, 2023
('uint16', 'UInt16', 'uint16', 'uint16', 'uint16_t', ''),
('uint8', 'UInt8', 'uint8', 'uint8', 'uint8_t', ''),
('object', 'Object', 'object', 'pymap', 'object', ''),
('string', 'Object', 'object', 'str', 'char *', 'util.get_c_string'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can make this const char *

{{if dtype == 'object'}}
k = kh_get_{{ttype}}(table, result_keys.data[i])
val = <object>result_keys.data[i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did the object branch need to change?

@@ -53,9 +53,9 @@ cdef extern from "khash_python.h":
kh_pymap_t* kh_init_pymap()
void kh_destroy_pymap(kh_pymap_t*)
void kh_clear_pymap(kh_pymap_t*)
khuint_t kh_get_pymap(kh_pymap_t*, PyObject*)
khuint_t kh_get_pymap(kh_pymap_t*, object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a cleanup or did it actually change the behavior in some way? I have a slight preference for the extern declarations to match exactly the way they are declared in the source headers to minimize having to interpret what is happening; at the end of the day the linker isn't going to type check just resolve the symbols

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Apr 22, 2023
@@ -283,6 +299,8 @@ ctypedef fused htfunc_t:

cpdef value_count(ndarray[htfunc_t] values, bint dropna, const uint8_t[:] mask=None):
if htfunc_t is object:
if lib.is_string_array(values, skipna=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a viable way to avoid needing lib here? this file is relatively low-dependency and lib is the opposite

@mroeschke
Copy link
Member

Closing to clear the queue, but feel free to reopen when you have time to revisit

@mroeschke mroeschke closed this Aug 1, 2023
@mroeschke mroeschke added the Mothballed Temporarily-closed PR the author plans to return to label Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Mothballed Temporarily-closed PR the author plans to return to Performance Memory or execution speed performance Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants