-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Improve performance of hash sets in read_csv #25804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ts_t khash struct
Codecov Report
@@ Coverage Diff @@
## master #25804 +/- ##
==========================================
+ Coverage 91.27% 91.27% +<.01%
==========================================
Files 173 173
Lines 53002 53002
==========================================
+ Hits 48375 48376 +1
+ Misses 4627 4626 -1
Continue to review full report at Codecov.
|
1 similar comment
Codecov Report
@@ Coverage Diff @@
## master #25804 +/- ##
==========================================
+ Coverage 91.27% 91.27% +<.01%
==========================================
Files 173 173
Lines 53002 53002
==========================================
+ Hits 48375 48376 +1
+ Misses 4627 4626 -1
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25804 +/- ##
=======================================
Coverage 91.27% 91.27%
=======================================
Files 173 173
Lines 53002 53002
=======================================
Hits 48375 48375
Misses 4627 4627
Continue to review full report at Codecov.
|
Results of running
|
More precise benchmarking:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good stuff! looks good to me
@@ -71,9 +73,6 @@ cdef: | |||
float64_t NEGINF = -INF | |||
|
|||
|
|||
cdef extern from "errno.h": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulling out errno ideally could have been a separate PR but I'm fine doing it here
# Resize the hash table to make it almost empty, this | ||
# reduces amount of hash collisions on lookup thus | ||
# "key not in table" case is faster. | ||
# Note that this trades table memory footprint for lookup speed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a ballpark estimate for how much more memory this might be using? I feel if anything we get more issues for read_csv in regards to high memory usage than slow parsing so just want to be careful of tradeoffs that we make, especially since we don't have any memory benchmarks set up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worst case would be when bucket_size
was increased from 128 to 1024.
Memory for 128-sized table (on x86_64) ~3.6 KB, for 1024-sized table (on x86_64) ~21.04 KB.
For default case bucket size is 32, so memory is increased from 1.67 KB to 6.04 KB.
I think it's acceptable to trade about 15 more KB (as there are typically three such maps - NA values, true-ish values, false-ish values) to gain up to 15% of parsing speed.
P.S. Memory size formula where B
is bucket_size
:
sizeof(khint32_t) * 4 + sizeof(void*) * 3 + B * (sizeof(khint32_t) + sizeof(char*) + sizeof(size_t)) + 256 * sizeof(int)
which gives 1064 + B * 20
bytes (again on x86_64).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My math might be fuzzy but assuming someone is parsing say 10M records and this increases the size of each record by 15kb, would that require roughly 150MB more of memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is 15 KB per parser, not per record :)
So in a typical case (when one parser exists at a time) it's only 15 KB more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha thanks for clarifying. I suppose that's quite the difference
thanks @vnlitvin |
@vnlitvin : Nice work! |
@vnlitvin getting tons of build warnings like: https://travis-ci.org/pandas-dev/pandas/jobs/511269649 if you can do anything about this.
|
git diff upstream/master -u -- "*.py" | flake8 --diff
Improved performance of checking of string presence in hashsets by:
Hash bucket increase is needed due to khash search algorithm - it computes a hash of key being searched, uses its last N bits where N corresponds to bucket size, and then does an open ended search over the table if collision happened with calling a comparison function over keys. When keys are strings (or, more precisely,
char*
things) comparison function isstrcmp()
which complexity is O(N), thus we want for collisions to be rare.Most lookups of those sets happen when values are checked against N/A table which by default has 17 entries. Without the change bucket size is 32, thus hash collision probability is 17/32 (~53%), after the patch bucket size is 256 and collision probability is 6.6%.
I will add
asv
results and whatsnew later.