Skip to content

PERF: Improve performance of hash sets in read_csv #25804

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 22, 2019

Conversation

vnlitvinov
Copy link
Contributor

@vnlitvinov vnlitvinov commented Mar 20, 2019

  • closes N/A
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Improved performance of checking of string presence in hashsets by:

  1. implemented lookup table for first symbol of entries - this speeds up negative answers
  2. increased size of hash table to reduce the probability of hash collision

Hash bucket increase is needed due to khash search algorithm - it computes a hash of key being searched, uses its last N bits where N corresponds to bucket size, and then does an open ended search over the table if collision happened with calling a comparison function over keys. When keys are strings (or, more precisely, char* things) comparison function is strcmp() which complexity is O(N), thus we want for collisions to be rare.
Most lookups of those sets happen when values are checked against N/A table which by default has 17 entries. Without the change bucket size is 32, thus hash collision probability is 17/32 (~53%), after the patch bucket size is 256 and collision probability is 6.6%.

I will add asv results and whatsnew later.

@vnlitvinov vnlitvinov changed the title Khash na sets Improve performance of hash sets in read_csv Mar 20, 2019
@codecov
Copy link

codecov bot commented Mar 20, 2019

Codecov Report

Merging #25804 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25804      +/-   ##
==========================================
+ Coverage   91.27%   91.27%   +<.01%     
==========================================
  Files         173      173              
  Lines       53002    53002              
==========================================
+ Hits        48375    48376       +1     
+ Misses       4627     4626       -1
Flag Coverage Δ
#multiple 89.83% <ø> (ø) ⬆️
#single 41.76% <ø> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/util/testing.py 89.4% <0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85c3f82...da9d4e6. Read the comment docs.

1 similar comment
@codecov
Copy link

codecov bot commented Mar 20, 2019

Codecov Report

Merging #25804 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25804      +/-   ##
==========================================
+ Coverage   91.27%   91.27%   +<.01%     
==========================================
  Files         173      173              
  Lines       53002    53002              
==========================================
+ Hits        48375    48376       +1     
+ Misses       4627     4626       -1
Flag Coverage Δ
#multiple 89.83% <ø> (ø) ⬆️
#single 41.76% <ø> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/util/testing.py 89.4% <0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85c3f82...da9d4e6. Read the comment docs.

@codecov
Copy link

codecov bot commented Mar 20, 2019

Codecov Report

Merging #25804 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #25804   +/-   ##
=======================================
  Coverage   91.27%   91.27%           
=======================================
  Files         173      173           
  Lines       53002    53002           
=======================================
  Hits        48375    48375           
  Misses       4627     4627
Flag Coverage Δ
#multiple 89.83% <ø> (ø) ⬆️
#single 41.77% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/excel/_base.py 92.48% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85c3f82...2ceb850. Read the comment docs.

@vnlitvinov
Copy link
Contributor Author

Results of running asv continuous -f 1.05 upstream/master HEAD -b io.csv -a sample_time=1 -a warmup_time=1:

before after ratio test name
[fbe2523] [da9d4e6]
master khash-na-sets
1.34±0.02ms 1.26±0.01ms 0.94 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', 'high')
13.7±0.2ms 12.8±0.3ms 0.94 io.csv.ReadCSVSkipRows.time_skipprows(None)
1.41±0.01ms 1.31±0.02ms 0.93 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', None)
1.42±0.01ms 1.32±0.01ms 0.93 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', None)
32.9±1ms 28.8±0.2ms 0.87 io.csv.ReadCSVCategorical.time_convert_direct
12.7±0.2ms 10.9±0.1ms 0.86 io.csv.ReadCSVThousands.time_thousands(',', ',')
13.2±0.08ms 11.3±0.1ms 0.86 io.csv.ReadCSVThousands.time_thousands('|', ',')
10.7±0.07ms 9.06±0.09ms 0.84 io.csv.ReadCSVThousands.time_thousands('|', None)
10.9±0.1ms 9.17±0.2ms 0.84 io.csv.ReadCSVThousands.time_thousands(',', None)

@vnlitvinov vnlitvinov changed the title Improve performance of hash sets in read_csv PERF: Improve performance of hash sets in read_csv Mar 20, 2019
@vnlitvinov
Copy link
Contributor Author

More precise benchmarking:
asv continuous -f 1.01 upstream/master HEAD -b io.csv -a sample_time=2 -a warmup_time=2

before after ratio test name
[fbe2523] [da9d4e6]
master khash-na-sets
1.55±0.02ms 1.49±0ms 0.96 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', 'round_trip')
1.55±0.01ms 1.48±0.01ms 0.96 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', None)
13.2±0.4ms 12.5±0.1ms 0.95 io.csv.ReadCSVSkipRows.time_skipprows(None)
1.56±0.02ms 1.48±0.01ms 0.95 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', 'high')
1.84±0.02ms 1.74±0.01ms 0.95 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', 'round_trip')
1.81±0.02ms 1.72±0.01ms 0.95 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(True, 'iso8601')
1.56±0.02ms 1.47±0.01ms 0.94 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', None)
1.84±0.01ms 1.73±0.02ms 0.94 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', 'round_trip')
1.57±0.03ms 1.47±0.03ms 0.94 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', 'round_trip')
1.34±0.01ms 1.24±0.03ms 0.92 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', 'high')
1.35±0.02ms 1.24±0.01ms 0.92 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', 'high')
1.43±0.02ms 1.30±0.01ms 0.91 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', None)
1.43±0.03ms 1.29±0.01ms 0.90 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', None)
31.5±0.6ms 28.3±0.3ms 0.90 io.csv.ReadCSVCategorical.time_convert_direct
2.95±0.07ms 2.61±0.06ms 0.89 io.csv.ReadUint64Integers.time_read_uint64
13.1±0.1ms 11.2±0.05ms 0.86 io.csv.ReadCSVThousands.time_thousands('|', ',')
10.8±0.04ms 9.12±0.08ms 0.84 io.csv.ReadCSVThousands.time_thousands(',', None)
12.8±0.04ms 10.8±0.1ms 0.84 io.csv.ReadCSVThousands.time_thousands(',', ',')
11.1±0.2ms 9.11±0.05ms 0.82 io.csv.ReadCSVThousands.time_thousands('|', None)

@jreback jreback added Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Mar 22, 2019
@jreback
Copy link
Contributor

jreback commented Mar 22, 2019

lgtm. @chris-b1 @gfyoung

@jreback jreback added this to the 0.25.0 milestone Mar 22, 2019
Copy link
Contributor

@chris-b1 chris-b1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good stuff! looks good to me

@@ -71,9 +73,6 @@ cdef:
float64_t NEGINF = -INF


cdef extern from "errno.h":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulling out errno ideally could have been a separate PR but I'm fine doing it here

# Resize the hash table to make it almost empty, this
# reduces amount of hash collisions on lookup thus
# "key not in table" case is faster.
# Note that this trades table memory footprint for lookup speed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a ballpark estimate for how much more memory this might be using? I feel if anything we get more issues for read_csv in regards to high memory usage than slow parsing so just want to be careful of tradeoffs that we make, especially since we don't have any memory benchmarks set up

Copy link
Contributor Author

@vnlitvinov vnlitvinov Mar 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worst case would be when bucket_size was increased from 128 to 1024.
Memory for 128-sized table (on x86_64) ~3.6 KB, for 1024-sized table (on x86_64) ~21.04 KB.

For default case bucket size is 32, so memory is increased from 1.67 KB to 6.04 KB.

I think it's acceptable to trade about 15 more KB (as there are typically three such maps - NA values, true-ish values, false-ish values) to gain up to 15% of parsing speed.

P.S. Memory size formula where B is bucket_size:

sizeof(khint32_t) * 4 + sizeof(void*) * 3 + B * (sizeof(khint32_t) + sizeof(char*) + sizeof(size_t)) + 256 * sizeof(int)

which gives 1064 + B * 20 bytes (again on x86_64).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My math might be fuzzy but assuming someone is parsing say 10M records and this increases the size of each record by 15kb, would that require roughly 150MB more of memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is 15 KB per parser, not per record :)
So in a typical case (when one parser exists at a time) it's only 15 KB more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha thanks for clarifying. I suppose that's quite the difference

@jreback jreback merged commit cd057c7 into pandas-dev:master Mar 22, 2019
@jreback
Copy link
Contributor

jreback commented Mar 22, 2019

thanks @vnlitvin

@gfyoung
Copy link
Member

gfyoung commented Mar 22, 2019

@vnlitvin : Nice work!

@vnlitvinov vnlitvinov deleted the khash-na-sets branch March 25, 2019 11:05
@jreback
Copy link
Contributor

jreback commented Mar 27, 2019

@vnlitvin getting tons of build warnings like: https://travis-ci.org/pandas-dev/pandas/jobs/511269649

if you can do anything about this.

n file included from pandas/_libs/parsers.c:663:0:
pandas/_libs/src/klib/khash_python.h:109:27: note: expected ‘struct kh_str_starts_t *’ but argument is of type ‘const struct kh_str_starts_t *’
 khint_t PANDAS_INLINE kh_get_str_starts_item(kh_str_starts_t* table, char* key) {
                           ^
pandas/_libs/parsers.c:26118:7: warning: passing argument 2 of ‘kh_get_str_starts_item’ discards ‘const’ qualifier from pointer target type [enabled by default]
       __pyx_t_1 = (kh_get_str_starts_item(__pyx_v_na_hashset, __pyx_v_word) != 0);
       ^
In file included from pandas/_libs/parsers.c:663:0:
pandas/_libs/src/klib/khash_python.h:109:27: note: expected ‘char *’ but argument is of type ‘const char *’
 khint_t PANDAS_INLINE kh_get_str_starts_item(kh_str_starts_t* table, char* key) {
                           ^
pandas/_libs/parsers.c: In function ‘__pyx_f_6pandas_5_libs_7parsers__try_bool_flex_nogil’:

anmyachev pushed a commit to anmyachev/pandas that referenced this pull request Apr 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants