Skip to content

PERF: Improve performance of hash sets in read_csv #25804

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 22, 2019
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.25.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,7 @@ Performance Improvements
int8/int16/int32 and the searched key is within the integer bounds for the dtype (:issue:`22034`)
- Improved performance of :meth:`pandas.core.groupby.GroupBy.quantile` (:issue:`20405`)
- Improved performance of :meth:`read_csv` by faster tokenizing and faster parsing of small float numbers (:issue:`25784`)
- Improved performance of :meth:`read_csv` by faster parsing of N/A and boolean values (:issue:`25804`)

.. _whatsnew_0250.bug_fixes:

Expand Down
11 changes: 11 additions & 0 deletions pandas/_libs/khash.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,17 @@ cdef extern from "khash_python.h":

bint kh_exist_str(kh_str_t*, khiter_t) nogil

ctypedef struct kh_str_starts_t:
kh_str_t *table
int starts[256]

kh_str_starts_t* kh_init_str_starts() nogil
khint_t kh_put_str_starts_item(kh_str_starts_t* table, char* key,
int* ret) nogil
khint_t kh_get_str_starts_item(kh_str_starts_t* table, char* key) nogil
void kh_destroy_str_starts(kh_str_starts_t*) nogil
void kh_resize_str_starts(kh_str_starts_t*, khint_t) nogil

ctypedef struct kh_int64_t:
khint_t n_buckets, size, n_occupied, upper_bound
uint32_t *flags
Expand Down
Loading